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ABSTRACT 


In this Phase I study funded under the Small Business Innovation Research (SB1R) program, 
statistical methods are developed using the predictive inference and entropy approach. Previous 
recent research has derived entropy as the natural measure of model approximation error from 
the fundamental statistical principles of sufficiency and repeated «ui»pifag In this study, the 
areas of nonnested multiple comparison, multivariable tiiM series analysis, adaptive ti«n« series 
analysis of changing proceeds, and optimal small sample inference are investigated. Constrained 
maaimum likelihood methods are developed for general nonnested multiple comparison. For the 
asymptotic optimality of these methods, a condition on the Fisher information and Hestian 
matrices must be satisfied. Applying these results to multivariate time series analysis, lower 
bounds ate derived for the achievable accuracy of the transfer function and spectral 

matrices, Markov and canonical variate analysis (CVA) provide a ■—— of numerically and sta¬ 
tistically stable model fitting of multivariable time series, and methods provide a basis for 
modeling and fitting time varying models of changing pcoccsses. ; J4ethods are derived for the 
optimal selection of data length for fitting slowly changing processors well as for optimal selec¬ 
tion of the data interval for detection of abrupt changes. Optimal smati minpie methods for mul¬ 
tivariate analysis are studied, and entropy methods are shown to provide Significant improvements 
in very small samples. Recommendations for Phase 0 research and d e velopment focus on the 
adaptive and nonadaptive time series analysis procedures developed in this study. 
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L INTRODUCTION AND OVERVIEW 


la this study, statistical methods are developed using predictive inference and entropy. This 
approach to statistical inference allows the treatment of several difficult s ta t istical problems that 
are not easily dealt with using traditional statistical methods. The particular sta t isti c al problems 
addressed are 

• statistical model building involving the determination of parametric 
model structure and order in the general case of multiple nonnested 
alternatives, 

• time series modeling and forecasting involving the determination of 
parametric model structure and order, 

• adaptive time series analysis involving optimal methods for tracking 
slow changes as well as for detecting abrupt changes or failures, 

• small sample inference for multivariate distributions of the exponen¬ 
tial family. 

A number of issues in these topics are resolved naturally in the predictive inference and entropy 
setting. This r e port provides an overview of the p rogr e s s of the Phase I research with detailed 
tarhwifi papers included in the Appendices. 

The recent interest in predictive distributions has come from several directions. Modem 
developments apparently begin with Jeffreys (1961, p.143) using a Bayesian approach as has much 
of the work following (Aitchison and Dunsmore, 1975, preface and p39). The frequentist 
viewpoint taken in this proposal has been stimulation by small sample problems (Murray, 1977, 
1979), model order and structure determination problems involving parametric models (Akaike, 
1973, 1974,), and nonnested multiple comparison problems (Larimore, 1977a, 1977b). Classical 
methods are conceptually ill-suited or perform poorly in practice on such problems. 

In the first Chapter, the approach using predictive inference and entropy is described. The 
basis of this approach is the derivation of entropy from the fundamental statistical principles of 
sufficiency and repeated —mpiiwg in the context of the predictive inference setup as first 
presented in Larimore (1963a). This provides a sound theoretical foundation that was previously 
lacking for the use of entropy as the natural measure of the error in approximating a true future 
density by an predictive density based upon a present sample. The generality of this 

entropy measure allows co m parison of statistical inference methods and the derivation of more 
optimal inference procedures including 
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• general inference method* such u parametric or nonparametric 

method* 

• exact evaluation of small sample procedure* 

• determination of model order or structure including the case of non¬ 
nested multiple comparison 

• rime aeries analysis including definition of optimal tracking of time 
varying p r oc e ss e s and optimal detection of abrupt changes. 

The generality of the predictive inference and entropy approach provides a basis for the generali¬ 
zation of the present statistical and predictive inference methods to more general statistical prob¬ 
lems. 

In Chapter 2, the multiple comparison of nonnested constrained models is developed. Previ¬ 
ous developments in multiple comparison have considered largely the nested case and assume that 
the true model is contained in one of the models. In the pr esen t study, the case of constrained 
i"«timiim likelihood estimation is considered where the true model may not be contained in any 
of the hypothesized models. The entropy measure provides a measure that allows the multiple 
comparison problem to be viewed as a model approximation problem. In this more general con¬ 
text the AIC procedure and generalizations of it are found to give asymptotically optimal predic¬ 
tive inference procedures as measured by entropy. In the nested case, these procedures reduce to 
the generalized likelihood ratio (GLR) test where the probability of rejection is a function of the 
number of additional parameters in the alternative model not contained in the null hypothesis. 

Time series analysis for stationary processes are considered in Chapter 4. The entropy meas¬ 
ure provides a direct interpretation of the achievable accuracy in estimation of the power spec¬ 
trum of a process. The entropy is expressed as a squared relative error in estimating the spec¬ 
trum. A generalization of this to multiple time series relates to principle components of the pro¬ 
cess cross spectral matrix. A lower bound is determined such that the expected integral of the 
squared relative error in spectral estimation is bounded by the number of estimated parameters 
divided by twice the sample tize. An example of spectral estimation of an ARMA(43) process 
using spectral smoothing, Autor e gr e ss i ve modeling, and ARMA modeling shows the relative error 
in these estimation methods as dependent on the number of estimated parameters. 

The topics of Markovian time series or state space models provides an approach to time 
series analysis that is readily computable and is easily extended to the case of changing pro ces s es . 
The general state space model form is developed for Markov pro ces s es . The canonical variate 
analysis (CVA) method gives a direct and numerically stable computational method for determin¬ 
ing stats space models from observational data. The basic computational method is the general¬ 
ized singular value decompontion (SVD). This method allows for the direct determination of the 
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optimal model state order without the computationally intensive fitting of such models for the 
evaluation of model fit. Once the model state order is determined, the state space model coeffi¬ 
cients are simply computed by regression. This method generalizes easily to changing processes. 

Adaptive time series methods are developed in Chapter S. Primarily two types of changes 
are considered, slowly varying changes and abrupt changes such as faults. Time varying Markov 
pr oc es s e s are developed for such changing p roc e ss e s. Such pro ces ses provide the hypothesized 
models for developing optimal tracking and detection of abrupt changes. An A1C based pro¬ 
cedure is derived for the near optimal selection of the data length to use in model fitting. An 
example is given of estimating the spectrum of a time varying proc e sses that gives results near the 
best previous solutions that are much more specialized. For abrupt change detection, a generali¬ 
zation of the A1C procedure is required since the comparison of models fitted on different data 
intervals is required which is not considered in the A1C formulation. Application of these 
methods to simulated data of abrupt changes in an ARMA(4,3) pro ce sses including jumps in the 
I state, changes in the dynamics, and change in the variance of the excitation noise p ro ce ss e s, 

demonstrates that the procedure is sensitive to the detection of these very different types of 
abrupt changes. 


Small sample multivariate inference procedures are described in Chapter 6. Since the 
entropy measure gives an exact rational measure of the relative error of statistical inference pro¬ 
cedures in «"»*! samples, it provides the bases for evaluation and development of small sample 
inference methods. The historical approach to predictive inference involves the derivation of a 
Bayesian predictive density. Although the method is Bayesian, in certain instances, the resultant 
predictive density has certain invariance properties which lead to an optimal predictive density in 
terms of the entropy measure. Another approach involves the direct solution for the optimal 
invariant predictive density minimizing the entropy measure. This optimal invariant procedure 
leads to the same predictive density as the Bayesian predictive density using a noninformative 
prior. These methods are compared for the multivariate normal distribution with the estimative 
and best normal estimation procedures. 









2. APPROACH USING PREDICTIVE INFERENCE AND ENTROPY 


The coocepts of prediction and inference baaed on a aet of data are very old and underlie 
much of the scientific method. While the scientific method has been much discussed in philo¬ 
sophical and qualitative terms, there has been very little in the literature from a basic statistical 
viewpoint. The mom extensive literature appears to be that associated with predictive densities or 
predictive distributions (see Aitchison and Dunsmore, 1975). That approach is to a large degree 
Bayesian, although more recent treatments have developed a purely frequency sampling interpre¬ 


tation in connection with use of entropy or Kullback information. The weak point in the fre¬ 
quency approach was the seemingly arbitrary use of the entropy measure of model approximation 
error. More recently, however, the result of Larimore (1983a) has established the fundamental 
nature of the entropy measure based upon the statistical principles of sufficiency and repeated 
mn pling The entropy measure has in addition a very natural interpretation as the log relative 
odds in comparing two predictive densities in predicting the future sample. This gives a central 
role to the entropy measure. The use of the entropy measure for decisions on model order and 
structure was pioneered by Akaike (1973), and has been applied to many diverse statistical prob¬ 
lems particularly in time series analysis. The justification given to the entropy measure in the 
Akaike approach, however, has been largely heuristic. Because of the importance of the justifi¬ 
cation of the entropy measure, the derivation and important concepts are outlined below in Sec¬ 
tion 2.1. The use of entropy in comparing model structure selection procedures and for exact 
small sample inference is discussed in Section 22. This approach to developing statistical pro¬ 
cedures using predictive inference and entropy is then applied to the various topics in the follow¬ 
ing chapters of the r e p ort. 


2JI Dartvatfsa of the Entropy Measure of 

Predictive inference involves an experimental situation with two trials, an informative trial 
with observations x and a predictive trial with observations y. The joint distribution of the two 
trials is permitted any statistical dependence and is described by a joint probability density p(x j). 
The objective is to choose a predictive distribution or density which, for each posable observed x , 
is a probability density for the future outcome y. More precisely, consider a family p(y\x,a) of 
predictive densities where the index a specifies a particular predictive distribution. For a particu¬ 
lar choice of a, p(y\x# i) can be viewed as a conditional probability density of y given x for 
predicting the distribution of the future observation y given an observed value of x. The predic¬ 
tive inference problem involves selection of a criterion of fit for appraising the goodness of 
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approximation to the true conditional probability density p(y| x) by the various predictive densi¬ 
ties p(y j x,a) specified by different a. The choice of such a criterion of fit is the primary topic of 
this section. Negative entropy is derived as the natural measure of model approximation error for 
any predictive distribution. 

Consider a family C - { pjy\ x), a 64} of predictive densities for approximating the true 
density p.(y\ x) of the predictive experiment y given the informative experiment x, where x and y 
are vectors of dimension K and L respectively with true joint density p,(x,y). For the predictive 
inference problem, a relative measure of goodness of approximation of p.(y| x) by the various 
pjy | x) is desired. To this end, a repeated sampling experiment is considered in which joint ran¬ 
dom samples (x,^) for i are drawn repeatedly from a population with density p. (*o0- 

The probability density of the joint predictive experiments Y ~(y x . y N ) predicted by the a*th 

model using X * (xj. x M ) is 

PaO'lJn-ftp.foU) (2-1) 

/-I 


The probability density for Y can be considered as indexed by the pair (a^f). Statistical infer¬ 
ence is considered about the true density p.(Y | X) of Y from among the family of probability den¬ 
sities F - { p m (Y\ X), a(A} for a fixed X . 

To consider the essential statistical information about the future sample Y given by the 
predictive densities p a (rj X), the sufficiency of the likelihood function (Zacks, 1971, p. 61) is 
used. From this principle, any inferences about the family F drawn on the basis of the sample 
(r| X) follow from the observed values of the likelihood function p a (Y\X) tot a (A. The set of 
likelihood ratios formed from pain of these likelihoods is also a sufficient statistic (Cox and Hink- 
ley, 1974, p. 20-1, see also p. 37-9 for a discussion of likelihood and sufficiency principles). 

For inference about the densities p x and p 2 , *0 of the information is contained in the likeli¬ 
hood ratio 


* Pi(y<l*<) 

n „_/v.i 


,V, Plbi\*t) 


which has the intuitive interpretation of the relative odds of observing the data Y of the repeated 
predictive trials from each of the distributions p x and p 2 given fixed informative data X. The 
behavior of A* as the number of repetitions becomes large is most easily seen by expressing it as 


tl 1 , . , „ .. 1 £ . Pi(y t \*i) 

Hz * - Jm -X 


'j.'AWAV A'.v.v”.v.\ 
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- Ifp*(y\ *) dy p.(x) dx (2-3) 

For a Urge number of repetitions, the odds will overwhelmingly favor Pi or p 2 if the limit is 
respectively strictly positive of strictly negative. The preference for one distribution over the 
other as expressed by the likelihood ratio tends to grow exponentially with the number N of 
repeated trials. If (2-3) is zero, then there will be no consistent preference with large numbers of 
trials. Although for a finite number N of repetitions the likelihood ratio A w depends upon the 
particular samples (X,Y), asymptotically for large numbers of repetitions this dependence disap* 
pears. 

The direct pairwise comparison of predictive densities is not necessary if the Kullback- 
Leibier conditional discrimination information (Kullback and Leibler, 1931; Kullback, 1939, p. 13) 

f p*(y\ x ) 

Iy\ i(P m J>j - fp>(y\ *) lp g ^) dy (2-4) 

of p a relative to p, is used which is a function of x. Note that the order of p a and p. are not 
interchangeable with the latter playing the role of the truth. 

The likelihood ratio (2-3) is expressed in terms of the Kullback information as 

Jj® jf lo * A* - E,{/,| x (p. fid - lj\ m(p<J> i)} (2-5) 

where E M denotes expectation with respect to the true density p,(x). The criterion is thus deter¬ 
mined as the negative entropy, or negentropy for brevity, defined as 

, e P.(yU) 

- B M /^ t (p.jfJ - fp,(x)dx fp.(y\ x) l°g p dy (2-6) 

the expected Kullback conditional information of the predictive density relative to the true condi¬ 
tional density p,(y\x). In the repeated sampling experiment, the predictive density with the 
smaller negentropy relative to the true is ultimately preferred. The negentropy (2-6) thus orders 
the goodn es s of a set of predictive densities in approximating the true density. Also in comparing 
any two predictive densities p x and p 2 , *be respective difference has the intuitive interpretation as 
the exponential rate at which the likelihood ratio diverges. 

The above derivation of the entropy measure of approximation of a predictive density uses 
only the predictive inference setup in the repeated sampling context along with the principle of 
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sufficiency. The sufficiency principle is one of the few generally accepted principles in statistical 
inference. Various repeated sampling principles have been formulated, however the difficulty has 
been the choice of an evaluation criterion for comparing various sampling distributions. The 
entropy measure gives a criterion that is based upon basic statistical principles of inference. 

2J Use Use of Entropy in Statistical Inference 

The entropy measure of error in approximating a predictive density is very general and can 
be applied in diverse modeling problems. In this section, some of the general model selection 
problems are described which include the nonnested multiple comparison problem, adaptive time 
series analysis of changing processes, and optimal small multivariate methods. 

From the derivation of the entropy measure, it can be seen that the entropy measure has a 
number of .very attractive features: 

• It applies to completely general modeling problems including non- 
parametric methods. 

• It applies exactly to small samples. 

• Only the fundamental statistical principles of sufficiency and 
repeated sampling are used. 

• It applies to time correlated problems such as time series model 
identification and tracking. 

• Statistical inference can be fundamentally viewed as model approxi¬ 
mation. 

Note also that the predictive distribution can include an entire model structure- 
determination/parameter-estimation scheme by setting 

Pk<y\ *) ■ p(y )) (2-7) 

where for every x, k{x) is the * minimizing a model structure determination criterion. Thus for 
each a, p a can be regarded as a model fitting procedure including the choice k(x) of model struc¬ 
ture. 





The negentropy measure is entirely applicable to exact small sample inference, system iden¬ 
tification, and detection of abrupt changes that include decisions among a multitude of 
parametric model structures which may be nonnested. Several predictive inference problems 
have been considered in the literature. Previous work has used the negative entropy measure in 
much more restricted formulations where 


£ 

S3 

1 
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• the informative and predictive samples were assumed to be indepen¬ 
dent (Aitchison and Dunsmore (1975), Akaike (1973)) which does 
not include the time series forecasting problem 

• the use of the negative entropy measure was considered as arbitrary 
(Aitchison (1975), Murray 1979)) or justified only asymptotically for 
large samples by heuristic arguments (Akaike (1973)) 

• the negative entropy measure was justified as a basis for comparing 
distinct parametric model structures (Akaike (1973)), but not for 
comparing model selection procedures which include choice of the 
model structure (Larimore, 1983a) 

• only the case of nested structures such as autor e gr e ssive models 
were justified (Akaike (1973)) although there has been wide spread 
application of it to the general nonnested case such as ARMA 
models 

• the previous literature on the use of information theory in statistical 
inference justifies its use by arguments of information transmission, 
a set of postulates supposed to be obvious, or by analogy with 
entropy in statistical mechanics none of which are convincing from 
the point of view of statistical inference (Kendall (1973), Hart 
(1971)). 

Thus the results of Larimore (1983a) give a solid theoretical justification for the use of the nega¬ 
tive entropy measure in a general setting which makes possible the further general development 
of predictive inference statistical methods. 

For the parametric case, the very general considerations above simplify somewhat. For the 
structure determination problems, an estimator of the form as in (2-7) associates a parameter esti¬ 
mate t(x) with each possible value a of the sample space. To simplify the discussion in this sec¬ 
tion, we can consider that the informative experiment x (fit set) and predictive experiment y 
(check set) are independently and identically distributed K-dimensioual vectors. In Section 4, the 
general dependent time series analyse case will be discussed. We predict the density of y by 
p{yMx)) where p(y,t) is the parameteriaed class of densities for y. The negative entropy meas¬ 
ure (2-6) radacas in the parametnc case to that suggsstart by Akaike (1973) and is expressible as 

«<p. J) - £,*<*./) - E, / p(y, t.) tog- * 0 ? 0 dy (2-8) 

p(y .•(*)) 

where * daaoaee the true value of (he parameter • and where E, denotes expectation with 
re spect to the informative sample - fbat the estimator i(r ) may involve different model orders 
or structures as in (2-7) is ao conceptual difficulty although it may complicate the evaluation of 
the negsatropy (2-6). The predictive tampie , icbeck set) is never actually drawn, but we wish to 
devise da em o n procedures which would optimally predict in terms of (2-6). 
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The major problem is to devise model-estimation/structure-determination schemes 

which come cloee to mimmiring the negentropy. A major step in that direction was made by 
Akaike (1973) in proporing an extension of the maximum likelihood method to compare different 
model orders or structures. Suppose that 0*00 is the maximum likelihood estimator for a given 
restriction of the parameters 0 to a subspace H k that is defined for every x in the sample space. 
Then we wish to partition the sample space into the disjoint subsets X k so that for x (X , the esti- 


e t - U ) = 0*CO for x(X k (2-9) 

is used. Akaike (1973) shows that asymptotically for nested models, an unbiased estimate of the 
negentropy using the maximum likelihood model 0*00 for the whole sample space x(X is given 
by the Akaike information criterion (AIC) defined by 

AIC(k) - -21n p(x ,0*00) + 2*00 (2-10) 

where K(k) is the dimension of H k , i.e., the number of parameters estimated. The Minimum 
AIC Estimate (MA1CE) proposed by Akaike (1973,1974b) is to partition the sample space so that 
X k is the set of sample points for which 


Then the MAICE estimate is 


AIC(k) < AIC(j) for j = k 


&maice( x ) ~ ®*0O f OT *0f* 


TO* 


so that on the set X k , 6 yA j CE (x) is the muimiim likelihood estimate 0*00 corresponding to the 
model structure k with minimum AIC. 

For autoregressive models, Shibata (1981a) has studied the MAICE and other asymptotically 
equivalent procedures for model-estimation/order-determination. He adopted a spectral measure 
of accuracy that is asymptotically equivalent to the negentropy. He showed that asymptotically 
for large sample, MAICE minimi*** the negentropy measure of accuracy (2-8), which will be 
called entropy efficiency. Hence MAICE is asymptotically an optimal procedure for choosing 
autor eg r e ss i ve models. Shibata (1981b) also shows MAICE as asymptotically optimal for regres¬ 
sion problems which involve nonnested multiple comparisons. Other procedures for model order 
determination have been proposed (Bhansali and Downham, 1977; Schwarz, 1978) which 
emphasize the choice of true model order asymptotically for large samples, which is called order 
consistency. In most real problems the true order is infinite, and even if such a fiction were to 
exist, a predictive criterion is much more intuitive in most applications. Shibata (1983) has shown 
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that order consistency aad entropy efficiency are mutually exclusive so that a choice is required 
as to which of these criteria is most important. In particular, Shabata has shown that an order 
consistent procedure cannot be entropy efficient, and that an entropy efficient procedure will not 
be order consistent. 

Turing now to a different general problem, that of exact small sample inference, predictive 
inference and entropy provides a new approach to the problem. Past approaches to the 
samp: tference problem have involved a number at ad hoc procedures. The entropy measure 
provides a sound fundamental measure of the approximation error in predicting the density of the 
future experiment. One of the past approaches has involved the estiimuivt method where the 
predictive density is restricted to lie in the class of densities to contain the true. Recent 

results have shown that the use of more general predictive densities can give more optimal results 
as measured in terms of the entropy measure of model approximation error (Murray, 1979, 1977; 
Aitchison, 1975). The optimal predictive density has been derived in the class of invariant densi¬ 
ties mininifoing the entropy measure. This was derived before the justification of the entropy 
measure based upon the sufficiency principle. As shown in Chapter 6, this more general and 
optimal predictive density can be considerably better as measured by the entropy. 

In time series analysis, the advantage of the approach using predictive inference and entropy 
is that it provides a sound theoretical framework in terms of model approximation for the direct 
comparison of very general time series analysis models including: 

• Consideration of many complex hypotheses 

• Comparison of nonnested hypotheses 

a Comparison of dynamic models of different dynamic (state) orders 

a Consideration of models fitted over different data sets for detecting 
abrupt changes 

a Consideration of different adaptation rates for doing optimal model 
tracking 

The comparison of such diverse models is inherent in adaptive time series analysis and abrupt 
change detection, and previous investigations have not had available such a sound and general 
framework for solving these difficult problems. 









3 . CONTAINED NONNESTED MULTIPLE COMPARISON OF MODELS 


t 

< 

la this chapter, the general problem of nonnested multiple comparison is considered. In j 

order to develop a general theory, consideration is restricted to comparing models that are the 
result of constrained maaimum likelihood estimation. The objective of the discussion is to gen* 
eraliae the currently available procedures for nonnested multiple comparison in the constrained 
m««im.Mi likelihood context. 

The approach is to view the fitting of each alternative parametric model form as an approxi¬ 
mation procedure, which includes the notion that the true model is in general not contained in 
the dam of parametric models considered in the model fitting. This is a departure from previous 
approaches that involve primarily asymptotic arguments where the parametric models approach 
the true model as the —"T** aim becomes large. Such an asymptotic argument begs the question 
of model approximation since asymptotically there is no error in the approximation. It is very 
important in practice to determine the extent to which the asy mp to ti c approximations are accu¬ 
rate in moderate or small samples. 

Another area of weakness in available approaches is the assumption of nesting in comparing 
models using entropy methods. The derivations of Akaike involve the assumption of nested 
models which considerably restricts the applications of the methods. In practice, the AIC cri¬ 
terion has been applied in a much wider context than the comparison of nested models. 

The results of this chapter will provide the basis of much more general decision procedures 
for adaptive time aeries analysis. In these problems, the comparison of different models based 
upon different intervals of data are compared to determine if an abrupt change has occurred. 

Previous entropy methods have only compared different models for a given interval of date. 

The first remit to be discussed is the generalization of the usual maximum likelihood theory 
to the constrained case. The regular case is considered where the log likelihood function is 
expandable in a Taylor series (Cox and Hinkley, 1974, p. 281). These conditions permit the inter¬ 
change of expectation and differentiation. 

Following the notation of Larimore (1986d, contained in Appendix A), let /(* ,•) denote the 
log likelihood function of the informative sample * considered as a function of the parameters 0. 

Denote by S the expectation with respect to the true density p(x ,0) with true parameter 0. The 
negative entropy measure is used as the measure of approximation to the true density by an 









approximating density p(x,0*) with parameter value 0* from the subspace 0 4 of parameten. The 
projection & at § onto the s u b apa c e 0 4 is defined as the parameter value 0* minimizing 

-£/(*,e)-£/(*,•*) (3-1) 

so that the projection 0* satisfies the condition 

El (x#)- 0, (3.2) 

where ' denotes the derivative with respect to 0*. The minimum is unique if and only if the Hes¬ 
sian D% given by D, ■ El "(x ,0* ) is positive definite. Thus for a constrained class of models, the 
projection of the true parameter value defines the best approximation to the true density in the 
class of approximating densities. 

Consider now the constrained maximum likelihood estimate 0* in the subapace of parame¬ 
ten 0| satisfying the likelihood equation 

r(x#)-0 (3-3) 

Then under the regularity conditions, we have for a positive definite Hessian D* and asymptoti¬ 
cally for large informative sample x that 

• 0* is an unbiased estimator of 0* 

• the estimation error covariance matrix is 

£(0 l -0 l X0 l -« l ) r - (Dfr'E {/ r (*,0*)/'(*,0*)KD^)* 1 (34) 

For the unconstrained case, the middle term is the Fisher information matrix and is equal to 
minus the Hessian D%, but in the general constrained case this is not true. 

Now consider the likelihood /(y{x,0) of the predictive experiment y conditioned on the 
informative experiment x. From the above results, the negentropy can be easily determined. 
Fjcpanding the log likelihood to second order and taking expectation gives the negative entropy as 

+**,(M*) (3-5) 

which holds asymptotically for large informative sample. Note that the second term is exact with 
no approximatkxi in small samples. The first term is an approximation involving the variation of 
the maximum likelihood estimate locally about the projection 0*. Thus the bias pan of the error 
in constraining the model is captured exactly. 
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3J lJuhtesd rrtmsHis «f Brtrapy 

la the previous discusrioo of entropy, the measure is considered as a measure of approxima¬ 


tion error between the true and approximating density. In practice, the true density is unknown 
and it is desired to obtain an estimate of the negative entropy based upon the observed informa¬ 
tive sample. To simplify the discussion, the case of x and y independent is considered. An accu¬ 
rate estimate of the negative entropy was first obtained by Akaike using the log likelihood as an 
estimate of the entropy with a correction for the bias. The Akaike information criterion (A1C) 


A/C(k) - -21ogp(x,0*(x)) + 2X(k) 


( 3 - 6 ) 


was derived as an unbiased estimate of the entropy where K(k) is the number of parameters 
adjusted in fitting the maximum likelihood estimates. The second term adjusts for the bias in 
estimating the entropy using the informative sample and adjusting the parameters in fitting. 
Akaike (1973) originally derived the A1C as an unbiased estimate for the relative comparison of 
the prediction error in comparing two nested models. The newing is »i«n important in >*«■» 
derivation because the models are not only nested but asymptotically approach the true model. 

In the more general case of constrained maximum likelihood estimation, a difficulty occurs 
in the estimation of the negentropy. Consider as above the case of x and y independent and 
identically distributed. As derived in Appendix A, the expected log likelihood difference of the 
informative sample is 


£[/(x,0) - /(x,0*)J =. -tr(Dfr l E{l r (x,0*)/(x,0»)} + £,(0,0*) 


(3-7) 


In the unconstrained case, asymptotically for large informative sample the trace term is equal to 
the number of parameters estimated. Unfortunately in the general constrained case, the Hesrian 
is not equal to the Fisher information matrix. The trace then depends upon the expectation with 
respect to the true unknown density of the fust and second derivatives of the log likelihood func¬ 
tion at the projection 0*. This cannot be computed in general since the true parameter is unk¬ 
nown. 


In cases where the Hessian and Fisher information are equal so that the trace is equal to the 
number of parameters, then two different parametric model structures, say 0* and 0* can be com¬ 
pared using (3-7) as 


£(f(x,0*) - /(xiOj - * dm(&) - dim(9>) + £(0,00 - £(0,0*) 


(34) 


which is equivalent to the A1C. The derivation for constrained mnimnm likelihood estimates 
makes dear some of the asnunptioos in previous entropy methods. The major difficulty is caused 
by the variation of the Fisher information or Hesrian as the parameter values change. In the 
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derivation above, the issue of netting does not trim. 


MNMMi 

In the case of netted testa, the MAJCE criterion reduces to the usual generalized likelihood 
ratio (GLR) teat. The threshold and resulting probability of rejecting the null hypothesis, in. the 
aias of the test, depends upon the number of additional parameters in the more general model. 

la comparing two hypotheses H 0 and H x , the AIC criterion is to choose according to the 
sign of the quantity 

AIC (Ho) -AICiHJ- -2log^~* + 2[JT(0) - *(1)J (3-9) 

p(*,r) 

The AIC criterion in the nested case is equivalent to the decision rule 
chocs* Ho if -21ogX < 2(JT(1) - JC(0)) 

chaost H , if -2logX 2 2|*(1) - JC(0» (3-10) 


where the gmtaralisad likelihood ratio X is defined by 

X , fifj* 1 ) 


(3-H) 


The thresboid 2[X(1) - JT(0)] is precisely twice the number of additional parameters under the 
hypothesis H 



In the case of a normal class of densities, the size a of the test is easily determined since the 
GLR statistic X is chi-equarod on K( 1) - *(0) degrees of freedom under the null hypothesis H 0 - 
Thus a is given as the solution of the relationship 




(3-12) 


were « is the number of additional parameters and X^ is the a probability point of the cumula¬ 
tive chi-squared distribution on « degrees of f reedom. Solving for a as a function of * gives the 
size of the test as a function of the number a of additional parameters as shown in Table 1. Note 
that the traditional a levels of 0.10, 0.05. 0.01, 0.005 and 0.001 corr es pond respectively to about 4, 
8,16, 20, and 30 additional parameters It has been known for a long time that in composite tests 
or repeated appli cati ons of simple additions of a single variable that the significance level is con¬ 
siderably reduced such as in step-wise regression. The entropy approach makes explicit the 
change in the size a of the test with the number of additional estimated parameters. 


-SI 










TabU 1. Dependence of Significance Level a on the Number n of Additional Parameter! 










4. TIME SERIES ANALYSIS USING ENTROPY METHODS 


Methods of predictive inference and entropy offer a number of advantages in the analysis of 
rim* aeries not available in other methods. In this chapter the basic time series analysis methods 
are d es cribed, while in the following chapter adaptive methods for time series analysis are 
developed using predictive inference and entropy. First the topic of the achievable accuracy of 
spectral analysis is addressed by relating the entropy measure directly to a relative squared error 
in estimating the power spectrum. Following this is a discussion of the Markovian representation 
of time series in terms of state space models which will be very useful in representing time vary* 
inf models of time series. The canonical variate analysis approach to rime series is then described 
which forms the basis of the adaptive time series analysis methods developed in the following 
chapter. 

4J Achievable S pec t ral Accuracy 

In this section, the informative and predictive samples will be denoted by a and v respec¬ 
tively to allow for the traditional use of x and y for random processes. Consider the problem of 
identifying a model for a pair of multiple stationary time series where x, and y, are exo¬ 
genous and endogenous time series respectively. Consider linear stochastic models in the form of 
a linear difference equation 

* - " ft + r » C 4 -!) 

H) 

where h(t ;S) is a causal linear system giving the response in y, to the past exogenous inputs x,, 
and where q, is white noise. Suppose that the probability density of the pro cess is parameterized 
by f. The exogenous variable x, will be considered as exactly observed, and the problem of 
modeling y, conditional on x t is considered so that prediction of y, from x, based upon such a 
model is the princip le problem. This also includes the problem of no exogenous variable so that 
only y, is observed. 

We wish to investigate the achievable accuracy in estimating a model for the process. In 
particular, the entropy measure will be developed to obtain the relationship between the number 
of parameters estimated in the model and the relative squared error in estimating the power spec¬ 
trum. An example illustrates the effect on the spectral estimation error due to the particular 
dam of parametric m od el s used in the ideotificatioo and the number of parameters estimated. 
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The predictive inference setup of Chapter 2 is considered where the primary interest is in 
the a sy mp totic behavior with large sample size of both the informative and predictive 
Consider an obs e rved informative sample u T -(xf . - • • ^Ivl) of sis N used to estimate the 
p roc es s model, and similarly consider a conceptual predictive sample v of size M used to evaluate 
the accuracy of the estimated model. The predictive sample is assumed to be identically distri¬ 
buted but independent of the informative sample. Consider the problem of inference on the 
parametric class {p(v,•),•&} of models with probability densities p(r,6) based upon the informa¬ 
tive sample u. Consider the conceptual repeated experiment where on each trial the 

samples m and v ate each drawn independently from the process 5 («,•) with • ■— to be the 
true parameter value. An estimative model J-p(v.$(**)) is chosen for the density of v by some 
parameter estimation scheme 6(e). For a stationary p ro ces s, the negative entropy (2-6) is linear in 
the predictive sample size U , so it is more useful to consider the per sample n ege ntropy. To this 
end, define the per sample negentropy denoted As derived in Appendix B, the I- 

divergence is given asymptotically by 

nsJ) - 4, 

p(y Mft)) 

- j e. 


*tI. - ri(»)is„(.x«(.) - tf<„) 


2ir 


(4-2) 


where E m d en ot es expectation relative to the informative sample it. 

In the multiple time series case, the spectral measure (4-2) has an intuitive interpretation in 
terms of principal components of the power spectrum in the frequency domain. Principle com¬ 
ponent re pr esen ta t ions of the spectral matrices 5„(«) and 5 a (w) have the form 


H") *„(«-) /*(-) - D(«) , L(m) 5„(«) £*(«) - £(«) 


(4-3) 


where /(«•) and L(m) given u a function of frequency <o are unitary matrix transformatioas so 
J (m )/' (w)W (w)L * («*) which diagonalize (w) and S w («) respectively and where £>(«) and 
£(«*) are d iag on al matricies. Ritering x(r) with transfer function L(«) gives the principal com¬ 
ponent process x(t) which is expressed in the frequency domain u X(w)-L(m)Y(») , and which 
has the diagnnal spectral matrix £(»), and similarly for q(t). 

The spectral measure (4-2) is shown in Appendix B to be 
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(4-4) 


The fine Mini on the right hand ride is the integrated squared relative error of the estimated cos¬ 
pectra of the principal components, while the second term is the integrated squared coherency of 
the ewimated spectrum D(m) which would be zero if D(w)-D(w). Thus the measure (4-2) has a 
dear interpretation in the multivariate case when the true spectrum D(u) is diagonal but where 
the apprarimating spectrum D(m) is permitted arbitrary coherency among components. The third 
term in the spectral measure (4-4) is asymptotically equivalent to replacing S„(«) by s„(«). This 
term is invariant to the unitary transformations /(«*) and L(m) where G («)-/( M W(«)£.*(«) is the 
transfer function H (w) expressed in the coordinate frame of the principal component series x(i) 
and y(i). The squared magnitude error | G tJ (u) - G y («)| 2 in the ij element of the transfer func¬ 
tion is weighted by the input signal to output noise ratio D(«*) h /£(w) // for the pair (ij). 

The spectral measure of accuracy can be bounded in terms of the number of parameters 
estimated. Suppose first for simplicity that the parametric class of models contains the true pro¬ 
cess and that k parameters are estimated. Then by Appendix B using the Cramer-Rao lower 
bound, the per sample negentropy is bounded by 


E m l(Sj ) - E m Jjj(9 - •)*>(• - •) * 2^ nF ' F - ^ 


(4-5) 


with equality achieved asymptotically for large informative sample N. This implies the bound on 
the achievable accuracy in spectral estimation given by 






(4-6) 


In the more general case where the order is infinite and the MA1CE procedure for choosing 
model order k is used, then Shibeu (1983) has shown the following result. For each informative 
sample size N there is an optimal order k'(N) which minimizes the tradeoff between variability 
and bias in the entropy measure 


*(•.«*) - - Til •*-** II l + *(•>•*) " + *(M*) 


(4-7) 











Then asymptotically for larfe informative sample N, the negative entropy of the MAICE model 
selecti on p roc e dure is exactly that of the model selection procedure using a fixed number of 
parameters equal to the optimal order *\ Thus even in the case of using an entropy efficient 
model selection procedure where the true model order is infinite, the achievable accuracy of 
spectral estimation (4-6) can be bounded by the function (4-7) of the sample size N with k -4*. 

To illustrate the use of the lower bound on the achievable accuracy, consider the ARMA 
(43) process 

y, - 13136 y,_ x - 1.4401 y,_ 2 + 1.0919 y,_ 3 - 033527 y,^ 

+ tv, + 0.17921 tv,_, + 032020 + 026764 w,_ x (43) 

with the ndise variance ofwuQ ■ 1.7258 xlO -2 . This process was analyzed by Gersch and Sharp 
(1973) and Akaike (1974b) to show the increased accuracy of ARMA models over AR models. 
For a sample size of 800, the optimal order was found to correspo n d to k' ■ 18 for fitting an AR 
model to the data. Akaike (1974b) fitted several models to «wiuI»k»h data using AR, ARMA, 
and Hanning window methods. Figure 1 shows the variability term of the spectral error as a 
function of frequency for the various model fitting procedures. Since the optimal order was used 
for the AR model, the bias is also included. The ARMA model has no bias nnc-. the full order is 
chosen with high probability. On the ocher hand, the Hanning window has significant bias since a 
fixed bandwidth is used to spectrally smooth the data at all frequencies, and this bandwidth is not 
sufficient to estimate the sharp peak without bias. Use of a wider bandwidth would increase the 
already larfe error at all the other frequencies. The AR and ARMA methods are clearly adap¬ 
tive in that the methods have a greater bandwidth to accommodate the rapid changes (large 
second derivative or curvature) near the peaks and troughs but lower bandwidth in regions with 
low curvature. The greater parametric efficiency of the ARMA method is clearly depicted in 
these regions of low curvature. Repeated simulation of the time series data from the ARMA(43) 
model and the maximum likelihood estimation of ARMA models with MAICE confirms that the 
lower bound i ndee d gives an accurate description of the spectral estimation error in practice (Lar- 
imote, Mahmood, and Mehra, 1984). Independent methods have been developed for obtaining 
simultaneous confidence bands on spectral estimates (Larimore, 1986c). Such confidence bands 
are proportional to the integrand of the spectral measure of accuracy so that the integrand gives 
an accurate measure of spectral error at each frequency as well. 
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Figure 1. Spectral Accuracy of Different Model Fitting Procedures Gower curves) in Approxi¬ 
mating the True Spectral Density (upper curve). 

< 

4 J Markovian Models of The* Scries 

In this section Markovian or state space models of time series are reviewed. Such models 3 

have not been widely used in time series analysis, although there is wide spread use of such \ 

models in filtering and prediction with numerous applications in engineering. State space models 
have a number of advantages in time series analysis that are attractive for automatic implementa- jj 

tkm on mi cr opro ce ssors using the canonical variate analysis method discussed in the next section. j 

Such procedures allow the automatic selection of model state order using entropy methods and 
lend themselves to adaptive methods for time varying p ro ces s es discussed in the next Chapter. 

The starting point of any approach is the joint probability distribution of the past and future 
observations ,®) where p, are the past inputs it, and outputs y, up to time t and/, are the 
outputs y, in the future at time r defined by 
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pj * ( • • • . /. r = ■■) (4-9) 

and 8 is a vector of parameters indexing the model. A fundamental property of a Markov process 
of finite sate order is the existence of a finite dimensional sate x, which is a linear function 

*, - Cp, (4-10) 

of the past p,. The sate x, has the property that the distribution of the future f, conditioned on 
the past p, is identical to that of the future f, conditioned on the finite dimensional sate x, so 

(4-11) 

Thus, only a finite amount of information from the past is relevant to the future evolution of the 
process. 

A stationary Markov process of some particular sate order can be represented by a vector 
difference equation with the general form (Lindquist and Pavon, 1981) 

x, +1 = <fcr, +Gu, +w, (4-12) 

y, - Hx, +Au, + Bw, + v, (4-13) 

where u is an input vector process, y is the output vector, x is the sate vector, and w and v are 
white noise processes that are independent with covariance matrices Q and R respectively. The 

matrices «h, A, B, G, and H determine the dynamics of the process and correlational characteris¬ 

tics of the disturbances. The various matrices are considered as functions of the parameters 8 
specifying the process. The white noise proc e sses model the covariance structure of the error in 
predicting y from u. 

For time series analysis and system identification, the parameterization of the model is an 
important issue. The elements of all of the matrices of the sate space model (4-12) and (4-13) 
and noise covariances are not independent parameters of the model. In fact for each distinct pro¬ 
bability distribution there is an equivalence class of models of the form (4-12) and (4-13) with the 
same distribution. It can be shown (Candy, Bullock, and Warren, 1979) that the number of 
independent parameters is 

K (k )=2 kn +n(n +1) 2 +km +nm (4-14) 

where k, n, and m are the vector dimensions of the sate x, , outputs y, , and inputs u, respectively. 
If there is no instantaneous feedforward so A =0, then the term nm is deleted, while if there is no 
input so the terms km+nm are deleted. 


















4-7 

The st ate sp wr* parameterization (4-12) and (4-13) is not unique, however in the next section 
a well conditioned procedure for selecting a unique model for the equivalence class will be 
described. For an individual state space model there exists a corr es ponding ARMA model and 
via vena. However the two classes are not equivalent as clasws. In general there is no ARMA 
class of models equivalent of a particular state space class. The ARMA class has one major diffi¬ 
culty • there is no global parameterization of the state space models of a given order. The diffi¬ 
culty is in the ARMA representation which becomes singular at certain models such as one 
involving the cancellation of a pole and a zero. This causes great difficulty in numerical methods 
in attempting to automatically identify higher order models which may involve such cancellations 
of poles and zeros. 

The major advantage of the state space models is the availability of efficient and numeri¬ 
cally well conditioned procedures for model identification discusKd in the next section, and the 
explicit Markov structure allows for the the development of direct adaptation procedures 
developed in the next chapter. 

43 Canonical Variate Analyse of Time Scries 

The canonical variate analysis method for identification of state space time series models is 
described in this section. The methods for the determination of the state order and selection of 
the state using concepts of canonical variate analysis are first discussed. The determination of the 
state space model is then computed by simple regression. The computation involves primarily a 
singular value decomposition of the sample covariance matrix of the process. 

A generalization of the canonical variate analysis method has recently provided a completely 
general solution to the static reduced rank stochastic prediction problem which is well defined 
statistically and computationally even when some or all of the various covariance matrices are 
singular (Larimore, 1986b). All other previous methods in the statistical literature do not address 
the general problem. This result is the foundation of the time series analysis methods using 
predictive inference and entropy including the adaptive time series methods. 

The original development of the canonical correlation analysis method of mathematical 
statistics was by Hotelling (1936; see also Anderson, 1958). The application of canonical variate 
analyse to stochastic realization theory and system identification was done in the pioneering work 
of Akaike (1974a, 1975, 1976). This initial work has a number of limitations such as no system 
inputs, no additive measurement noise, substantial computational burden involving numerous 
SVD's, a heuristic set of decisions for choosing a basis for representation of the system, and a 
number of approximations including computation of the A1C criterion for decision on model 
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order. 

Some important generalizations and improvements in Akaike's canonical correlation method 
have recently been made by Larimore (1983b). These include generalization to systems with addi¬ 
tive measurement noise and with inputs including feedback controls. A major departure of the 
approach from previous work is the use of a single canonical variate analysis to optimally choose 
4 linear combinations of the past for prediction of the future. The very natural measure of qua- 
dratically weighted prediction errors at possibly all future time steps is used. Formulated as such 
a prediction problem, it is shown how a generalized canonical variate analysis gives the solution 
explicitly. The interpretation of canonical variates as optimal predictors is central in motivating 
interest in such a problem formulation and is scarcely found in the statistical literature (Larimore, 
1986b). The optimal *-order predictors are not in general recursively computable, but the 
optimal state-space structure for approximating them is expressed simply in terms of the canonical 
variate analysis. The problem of finding an optimal Hankel norm reduced order model (Adam- 
jan et al, 1971; Kung and Lin, 1981) is related to the canonical variate approach (Camuto and 
Menga, 1982; Larimore, 1983b). The balanced realization method is a particular case of the gen¬ 
eralized canonical variate analysis (Desai and Pal, 1984). 

To more concisely discuss the canonical variate method, the results in Larimore (1983b, 
1986b) are briefly reviewed. Consider the problem of choosing an optimal system or model of 
specified order for use in predicting the future evolution of the process. As in Section 4 2, con¬ 
sider the past p, of the inputs u, and outputs y, before time t and the future of the outputs y, at 
time t or later so 

pj * ( ‘ * . fl ■ (K+irf+z* ■) (4-15) 

We assume that the procesres u, and y, are jointly stationary and denote the covariance matrices 
among/, and p, and V 

The major interest is in determining a specified number k of linear combinations of the past 
p, which allow optimal prediction of the future /,. The set of * linear combinations of the past 
p, are de n oted as a 4x1 vector m, and are considered as 4-order memory of the past. The 
optimal linear prediction /, of the future /,, which is a function of a reduced order memory m,, 
is measured in terms of the prediction error 

fill/, - All l f ) = E{(f, -fJZ} f (f, -/*,)> (4-16) 

where £ is the expectation operation and f denotes the pseudoinverse of a matrix. The optimal 
prediction problem is to determine an optimal 4-order memory 
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*» “ hP, (4-17) 

by chooring the k rows of J k such that the optimal linear predictor /,(«,) based on m, minimis 
the prediction error. 

As derived in Larimore (1986b), the solution to this problem in the completely general case 
where the matrices s //» and l /f may be singular is given by the generalized singular value 
decomposition as stated in the following theorem. 

Haemal 1. Consider the problem of choosing k linear combinations m, - J k p, of p, for 
predicting /, such that (4-16) is minimized where 2^ and Iff are posribty singular positive sem- 
idefinite symmetric matrices with ranks m and a respectively. Then the existence and uniqueness 
of solutions are completely characterized by the (2^ Xff >generalized singular value decomposi¬ 
tion which guarantees the existence of matrices 7, L, and generalized singular values yj, • ■ • ,y r 
such that 

m U . Li„L T . Jip/L 7 -Diag{y x *..*y r ,0.0) (4-18) 

The solution is given by choosing the rows of J k as the tint k rows of J if the 1-th singular value 
satisfies y k > 7 *+i- If there are r repeated singular values equal to y t , then there is an arbitrary 
selection from among the corresponding singular vectors, in. rows of J . The minimum value is 

min || II » (1 - 7?) + • • + (1 - it) (4-19) 

rmkV k l m Jfl • k *n 

This result not only gives a complete characterization of the solutions in selecting optimal 
predicton wi* from the past p, for prediction of the future /,, but the reduction in prediction 
error for all poeribte selections of order k is given simply in terms of the generalized singular 
values. This is of great importance since it avoids having to do a considerable amount of compu¬ 
tation to determine what selection of order is appropriate in a given problem. 

The generalbwd CVA method allows the determination of the tit of the various state space 
models and the selection of the best model state order before computation of the state space 
models. Consider the general case of identifying a state space model: given the past of the 
related random process u, and y,, we wish to model and predict the future of y, by a 4-order 
state-space structure of the form (4-12) and (4-13) 

x,- <bx,-K/u,+w, (4-20) 


y, - Hi, +Au, -t-ltw, +v, 


(4-21) 








Ia the computational probl em fives finite deu, the peat and future of the process are taken to be 
finite of length 4 left ao 

pj - (yl- 1 . • • • jJ*# t- 1 . • • • #t-d) • fJ * (y. r . • • ■ O',*) (4-22) 

Akaike (1976) propoaed chooring the number d of lags by learn square* autoregressive modeling 
using recursive learn squares algorithms and choosing the number of lags u that minimising the 
A1C criterion dhcuaaed below. This insures that a sufficient number of lags are used to capture 
all of the s ta t isticall y significant behavior in the data. This procedure is easily generalised to 
include the caw with inputs a,. The generalized SVD of Theorem 1 determines a transformation 
/ of the past that puts the state in a canonical form so that the memory m, » Jp, contains the 
statu ordered in terms of their importance in modeling the pro ces s. The optimal memory for a 
given order * then corresponds to selection of the first k statu. 

In order to decide on the model order to select, the Akaike information criterion (2-10) is 
computed where the number of parameters is determined from (4-16). Once the optimal k-order 
memory m, is determined, state-space equations of the form (4-12) and (4-13) for approximating 
the p ro ce ss evolution are easily computed by a simple multiple regremion procedure (Larimore. 
1963b). 

Since the CVA system identification procedure involves the state space model form, it has 
the major advantage that the model is globally identifiable so that the method is statistically well 
conditioned in contrast to ARMA modeling methods (Gevers and Wertz, 1982). Furthermore, 
since the computations are primarily a SVD, they are numerically stable and accurate with an 
upper bound on the required computations (Golub, 1969). Thus the method is completely reliable. 
It has been demonstrated u such in the time series analysis software Forecast Master that is com¬ 
mercially sold by SSI. From the theory of the CVA method (Larimore, Mahmood and Mehra, 
1964), it can be shown that there are no difficulties such as biased estimates caused by the pres¬ 
ence of a correlated f eed b a c k signal. The CVA method was demonstrated in real time identifica¬ 
tion and adaptive control of unstable aeroelastic wing flutter on a scale model F-16 aircraft in the 
NASA Langley Transonic Dynamics Wind Tunnel in February 1986. 










5. ADAPTIVE TIME SHOES ANALYSIS 


The state space model identification methods an developed in this chapter for adaptive time 
series analysis. The concepts of a changing Markov process are first discussed along with con¬ 
cepts of a piecewise constant model of the process that is constant over intervals of time. The 
approach to adaptation to dow changes using predictive inference and entropy is described. This 
leads to a model fitting criterion for choosing an optimal data interval that balances the decreas¬ 
ing sampling variability with increasing sample aim against the increasing misamodeling error due 
to use of a constant model over an interval of data. In fitting models involving abrupt changes, 
the models fitted over various intervals are compared to determine if an abrupt change has 
occurred. This involves the comparison of models determined from dam on different date inter¬ 
vals in predicting the error on a different interval. Several examples are given in using the pro¬ 
cedure an changes involving the dynamics, noise excitation, measurement noise, and ocher 
changes. 

Concepts of adaptive systems have been around since the 1950'i involving various senses of 
adaptation. The present literature on the subject includes a number of methods such as recursive 
computational schemes, exponential forgetting, lattice computational methods, etc., which have 
certain Ttnobs* that allow tuning of the algorithm to accommodate changes in the characteristics 
of the actual pro ces s es . Reviews of these and related methods are contained in several recent 
special issues of technical journals and books (Special Issue on Adaptive Control, Automatica, 
Voi. 20, No. S, 1985; Special Issue oa Linear Adaptive Filtering, IEEE Trans, on Information 
Theory, Vot. 30, No. 2, 1964; Honig and Messerschmitt, 1984). While these methods do permit 
some degree of adaptation to process changes, the methods of adaptation are ad hoc, and no 
sound underlying statistical principle for adaptation is propose d or demonstrated. As might be 
expected, these methods can work poorly on certain cases because of the lack of a sound statisti¬ 
cal basis. 

In particular, the recursive prediction error and lattice methods are convenient due to their 
recursive form and provide an estimate at every observation (Friedlander, 1982a, 1982b, 1983; 
Ljung and Soders tr o m , 1983). Also, the recursive algorithms can be used for adaptation by 
exponential weighting of the past data (Wellstead and Sanoff, 1981; Irving, 1979; Evans and Betz, 
1962). But the rational for exponential weighting has not been given a sound fundamental justifi¬ 
cation, but is used largely due to its ease of use. The choice of the exponential weight has been 
ad hoc and susceptible to misinterpretation of changing noise variance levels as time varying 
changes in the dynamics (Hagglund, 1983). 
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Adaptation to abrupt changes has been largely discussed in the fault detection literature. A 
comprehensive survey of fault detection methods is given by Willsky (1976). See also Mehra and 
Peschon (1971), Willksy and Jones (1974), Willsky (1980), and Isermann (1984). These methods 
have a number of short «vwming» in detecting changes in dynamics, computational illconditioa- 
ing, and excessive computational burden. 

The central computation of any adaptive algorithm involves the extension of methods for 
identification of stationary time series. There are several difficulties with currently available 
methods and software for the identification of system dynamics and noise characteristics. Current 
m ethods include the self tuning regulator (STR) (Ljung, 1983; Astrom, 1973; Astrorn et al, 1973, 
1977), mirimiim likelihood estimation (MLE) (Mehra and Tyler, 1973; Larimore, 1981a), the 
Box-Jenkins (BJ) method (Box and Jenkins, 1976), and a variety of heuristic approaches. The 
current state of the art in both MLE and BJ require that an analyst be involved in the procedure, 
and the required number of computational iterations is not bounded. The STR has been applied 
successfully to simple pro c e s s es , but is not completely reliable for general pro ces s es particularly 
when multi-input, multi-output systems are involved. In addition, the recursive prediction error 
algorithm used in the STR requires a good initial estimate and so is not suitable for short date 
where no apriori date is available. The heuristic approaches tend to be for special purposes and 
are rather unreliable in general applications. 


5.1 Medals far Chsagteg Prnri— 

The p robl em of modeling changing proces s es involves primarily two types of changes 

• changes that are slow compared to the date interval used for identif¬ 
ication 

• abrupt changes occurring infrequently compared to the date interval 
used for identification. 

If the changing process changes too rapidly or the abrupt changes occur too frequently relative to 
the date interval required for sufficiently accurate identification, then it is not possible to 
separate die actual system changes from the variability due to sampling. 

Consider a time varying Markov process where the conditional probability of the future 
given the past depends upon time r so 

(5-1) 

where the state, defined as linear combinations of the past p ,, varies with time as 

i, - C,p, (5-2) 













la order to upwa the «■— evolution of the lute of a Markov process in terms of a system of 
state space equations of the form 

*.*1 ■ ♦»*• ♦ c »“» + w t ( 5 - 3 ) 

y, • H,x, + A,*, + B,w, + v, (5-4) 

where the various matrices are time varying, it is necessary and sufficient that the conditional dis¬ 
tribution of the future /, conditioned on the state x, have the form 

P(f,\*n*j) m P(f,\*,- 1*1 *,9J) (5-5) 

This condition is essentially that the information in the state x, for predicting the future f, is con¬ 
tained in the state x f _ 1 delayed by one time step and the present inputs a, and outputs y t . This 
condition is satisfied by most physical systems since the memory is stored as energy in physically 
describable states. If the system changes abruptly, there may also be an abrupt change in the 
input or a large noise innovation v, associated with a significant change in the state of the system. 
Formulated in this way, it is apparent how the state space modeling methods are particularly use¬ 
ful. 

In any modeling method based upon a finite sample of date, only a finite number of param¬ 
eters can be determined which are much fewer in number than the number of data. Of the vari¬ 
ous poaubie methods for modeling, the simplest and least presumptive is the piecewise constant 
model which is constant over various intervals of data. Thus consider the model of the form (5-3) 
and (5-4) with piecewise constant parameters 0, 

" ♦<*« + c /“« + w » (5-6) 

>» +A,*, + B,w, + v, (5-7) 

The coefficient matrices Q t , G„ H,, A,, and B, are functions of the parameters 6, which are con¬ 
stant over an interval of time T, and change from one time interval to another. 

In the following sections, adaptive time series analysis methods are developed by considering 
various hypotheses concerning slow and abrupt changes. The predictive inference and entropy 
methods provide a means of objectively comparing the vast multitude of such hypotheses entailed 
in the adaptive time series analysis problem. 












The pnkiaa at wf f* to stow variations it primarily that at determining the length of 
wn iatesval to net in the time varying model (54) and (5-7). Contider the division of a taction 
at <«•— into 2* tubintervalt of length 2' templet were h and / are an integen. Then the variout 
hypothetee can be tuch at H,: divide the interval into tubintervalt of length 2 1 . For 

each tubinterval I Jt tot IX-?, tuppoee a state space model denoted M t it fitted uting the 
CVA method with AIC utad to select the best mode! state order. 

By successive appli c ation of the Markov property, the joint probability density of the obser¬ 
vations conditioned on the initial state it given by 

logp(y,. Y* | Y 0 ,9 h ) - XtoB’OOl rj-ifij) (54) 

>-» 

where &[ - (if, ... ,•£) it the parameter vector for the composite model consisting of all of the 
models over the 2* subintervals. This gives the log likelihood as the turn of the conditional log 
likelihoods on each subinterval. 

Consider the entropy measure of the composite model 8*. Using the asymptotic approxima¬ 
tion (3-5), the oegentropy is 

«<*,**) - -y- + *($,$*) - + *(*,•')] (5-9) 

1 /-I 1 

where 8* and 9 1 are understood to denote the parameters constrained under the respective 
hypotheses involving estimation of these models. Note that there is a tradeoff between the first 
term which increases with finer subdivisions of the data and the second term which decreases as 
the number of parameters is increased with finer subdivisions of the data. A minimum of the 
oege ntr opy defines the optimum subdivision of the data. 

To —the oegen trop y from the sample, the AIC is used. From the definition of AIC, 
the AIC co neap on d in g to (5-9) is given by 

.. 2 * 

AJC(Y ] . Y^#) - %AiC(Y t J) (5-10) 

/-t 

The optimal data length is chosen as the * mimmi«ng the above AIC. 

As an illustration of this scheme, it was applied to a problem in the 1978 Workshop on Spec¬ 
tral Estimation (Gerhardt, 1978) in estimating the instantaneous frequency of a sine wave with 
time varying frequency in the presence of interference and random noise. The data were 








generated by the equation 

y(t) - 1000 cos(a(t)) + 100 cos(b(t )) + *(/) (5-11) 

when the signal component has a time varying phase a(t) and the interference component has 
I*— 6(f) with the instantaneous frequencies given in Figure 2, 


\SIGN 


\ 


INTlLfEMCf 


ULiM til 0 24 0 32 0 40 0 41 0 56 0 64 

(SECONDS) 


— TIME 


Figure 2. Instantaneous Frequency of Signal (solid) and Interference (dashed). 


and where the 
estimate the im 


ence and noise. 


is uniformly distributed with -100<*(i)<100. The participants were told to 
nous frequency of the signal which was observed in the presence of interfer- 


Table 2 gives the per sample AJC corr e sponding to the identified state space models for 
each of the subintervals used which were of lengths 16, 32, 64, and 121 samples. Also given are 
the per s a mp le AJC's of the composite piecewise constant models for samples of 121. It is seen 
that the s ubin t e r val length producing the minimum average AJC for all of the data is 32 samples. 
This was then uaa as the optimal subinterval length for modeling the instantaneous frequency. 




















SAMPLE 

NUMBER OF POINTS IN DATA SUBINTERVAL 

TIMES 

16 

32 

64 

128 

22 

38 

12.40 


warn 


12.12 

12.15 

1 


1239 


MB 


70 

86 

102 

118 

134 

1166 

12.12 

1 


1237 




1264 

1252 



1253 




(1228) 1205 

(12.17) 1227 

(12.16) 1236 

(1236) 1236 

190 

166 

182 

198 

214 

230 

246 

1201 




1204 

1226 



1266 




1152 

1203 

1229 


1106 




11.98 

12.02 



11.47 




(1205) 1206 

(12.03) 11.79 

(12.17) 12.06 

(12.40) 12.40 

278 

20 A 

1203 




1225 

1102 



llfl 

1139 




XJfi 

1108 

11.43 

11.76 


342 

398 

XI A. 

1208 

mm 



1155 

■B 



12.08 




390 

(11.74) 1066 

(11.70) 1165 

(11.78) 11.79 

(11.75) 11.75 


ALL DATA 


12.02 


11.67 


12.04 


12.17 
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The **t«w*f of the instantaneous frequency wu chosen as the maximum of the spectrum 
obtained from the CVA model fitted to the data. The estimated instantaneous frequency is given 
in Table 3 along with the ocher three best solutions obtained by the other participants in the 
workshop. 


SAMPLE 

TIME 

TRUE 

WILEY AND 
CARMICHAEL 

WEINER 
ET AL 

ADAPTIVE 

CVA 

MAXIMUM 

ENTROPY 

32 

141.47 

141.47 

14139 

142.13 

141.7 

64 

145.73 

145.73 

14538 

145.13 

1463 

96 

15237 

15237 

152.69 

15232 

152.6 

128 

160.73 

160.73 

160.71 

16135 

1603 

160 

170 XX) 

170.00 

17032 

16930 

169.9 

192 

17937 

17937 

179.48 

178.40 

179.9 

224 

187.63 

187.63 

187.72 

188.06 

1883 

256 

19437 

19437 

19433 

193.90 

1943 

288 

19833 

19833 

19733 

19732 

198.0 

320 

200DO 

200.00 

199.72 

200.94 

2013 

352 

19833 

19833 

19739 

19835 

1983 

384 

19437 

19437 

193.93 

194.66 

194 X) 

416 

187.63 

187.63 

187.09 

187.99 

187.4 

448 

17937 

17937 

17837 

17932 

178.9 

480 

170.00 

170.00 

169.64 

17031 

170 X) 

512 

160.73 

160.73 

160.11 


164.0 


Table 3. Instantaneous Frequency Estimates. 

The best solution by Carmichael and Wiley (1978) uses a special zero crossing method that is 
applicable only to pure sine waves so that it will not generalize to more general spectra. 

The adaptive CVA method did about as well as the best of the methods other than Wiley 
and Carmichael, and much better than a lot of them. Note that the adaptive CVA approach 
makes no assumptions about the form of the spectrum or the character of the time variation. 
Also the adaptive CVA method is completely automatic, and in this example did not involve any 
considerations by an analysis to determine the choke or modification of the computations. As 
measured in terms o 4 the estimated instantaneous frequency, the method did very well. 


SJ Adapted— to Ah—pt Changes 

The primary problem in adaptive time series analysis is determining if and when an abrupt 
change has occurred. This problem reduces to comparing two intervals of data and determining 
if the same p roc e s s model is a better description of the observations than a different model for 
each interval of data. A complication of the problem is that the exact time is unknown and must ! 

be determined from the data. This involves the comparison of a multitude of hypotheses con¬ 
cerning the possible time of occurrence of the abrupt change. In addition, the ability to detect 

3 

1 









MUIllllllHIIHUiUnMVUn 


ivuwiwiminriwiiHiw mwxk ww 


fMIWRAMV- wmw uwnirwiMiniim 


5-8 

the abrupt chance depends upon the data length of the data intervals used. The best data length 
for detection depends upon the type of abrupt change since some changes affect the observations 
immediately and the effect decreases rapidly, while for other changes the effect on the observa¬ 
tions takes some time before it is apparent. Thus the consideration of different data lengths 
requires additional hypotheses to be considered and compared. The predictive inference and 
entropy methods give a sound basis for the comparison of the multitude of nonnested hypotheses. 


Consider the problem of determining if there is a change in the the pro ce ss between two dis¬ 
joint data subintervals. The detection problem considered is where the process is modeled as a 
slowly changing process using some efficient procedure such as given in the previous section. The 
notation of the previous sections will be used with subscript 1 or 2 corresponding to the data 
intervals Y x from the past slowly changing models and Y 2 from the new data that is to be com¬ 
pared for detection of an abrupt change. The subscripted parameters 6 , and 62 with or without 
other superscripts, hats, or tildes will denote models based upon the corresponding data interval. 
The data lengths of the two intervals Y x and Y 2 are generally different with the first data interval 
determined by the slow adaptation method and with the second set usually much shorter and of 
variable data length since the best data length for detection of abrupt changes is not known. For 
any selection of the two intervals, we wish to determine if there has been a significant departure 
in the p ro ces s characteristics between the two data sets. 

Ideally, the hypothesis ( 61 , 62 ) that the models are different over the two subintervals Y x and 
Y 2 verses the hypothesis 6 ; that the model is the same over the joint data set (Tj.Kj) would be 
compared. But this would involve a considerable number of comparisons. To avoid such numerous 
comparisons, consider the following approximation. Let data set Y x be choaen as the most recent 
optimal length interval preceding Y 2 with corresponding model 0j which provides a near optimal 
prior model Y t . To detect any abrupt changes in the system, consider the approximation of using 
the model 61 as an approximation to the joint model 6 *. 

As discussed in Section 43, consideration can be limited to conditional models given the 
P«* or equivalently the initial state at the beginning of the subinterval. Using such conditional 
models, the likelihood function reduces to 

p(ri,Yi*d-p(Yifidp(r 2 fid (5-12) 

The model on the first data set Y : is the same tot both the above hypothesis and the change 
hypothesis p(Y lt 9 x )p(Y 2 ,^ 2 ). The entropy measure is the difference of the above two log likeli¬ 
hoods which involves only the likelihoods on the second subinterval Y 2 so that 

*(Mi) - £[1°IP0'2,6 1 ) - \otp(Y 2 ,^)] (5-13) 






1 
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If the system in fact had an abrupt change between Y x and Y 2 , then since the data length K, is 
much longer than Y 2 , moat of the information in detecting the abrupt change is in the comparison 
of the two models and ®j on the data Y 2 . 

The problem is now to obtain an estimate of the entropy measure (4-13) from the observa¬ 
tional data. The observed log likelihood is used as an estimate of the entropy as in (3-7). The bias 
of this estimate of the entropy measure is 

E[l(YJ - l(Y$ - £[/(§) - /(Xj)] - £(/(§) - /(Xl)] 

= -dim(&) + R(6,YJ - R(e,Yi) (5-14) 

where the term dim(<&) in (3-8) is not present since the estimate 8 1 is a function only of the sam¬ 
ple Y x which is conditionally independent of the sample Y 2 . Thus an unbiased estimate of the 
difference of negentropies R(Q,Yj) - R(9,Y{) of the two models is 

l(Yd - KXt) + dim(&) (5-15) 

This gives a test for the occurrence of an abrupt change between the two data intervals. 
Depending upon the nature of the change and the process characteristics, the best detection 
interval will vary. Some changes give most of the information about the change over a short 
interval while others have a cumulative effect and require a long time interval to detect. 

Consider as an example of the procedure for detecting abrupt changes the ARMA(43) 
model (4-8). Three types of abrupt changes were simulated including an abrupt change in the 
dynamics, in the state, and in the variance of the excitation noise w,. The results of the pro¬ 
cedure for detecting abrupt changes for the case of no change and cases of a simulated abrupt 
change in the dynamics, the excitatioa noise, and the state are shown in Tables 4, 5, 6, and 7 
respectively. In each case, the entropy measure for detecting the abrupt change was computed 
over various intervals of data of lengths 50, 100, and 200 samples and the abrupt change occurred 
at the sample time 32S. In the case of no abrupt change, the entropy measure shown in Table 4 
is on the average 0.5186 with a standard deviation of 0.016. As shown in Table 5, the largest 
value of the entropy measure is in fact in the interval samples 300-350 containing the time of the 
abrupt change in dynamics. The following interval of samples 350-400 also indicates a large value 
of the entropy measure. The initial large value in interval 300-350 is due to a transient in the 
state as it settles to a new steady state variance which is largely complete in interval 350400. The 
entropy measure then persists at this value in following intervals. The abrupt change in noise 
variance shown in Table 6 has a different character. The negentropy changes abruptly in interval 
300-330 and remains at that value in succeeding intervals. With an abrupt change in the state 
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Table 4. Value of Per Sample A1C for Subintervals of the Sample with No Abrupt Change. 


SAMPLE 

NUMBER OF POINTS IN DATA SUBINTERVAL 


50 

100 

200 

2tfr 

250 

300 

350 

400 




0.495 

0.503 


0.513 

10214 

28220 

19.924 

11.628 


IMBM 




Table 5. Value of Per Sample A1C for Subintervals of the Sample with an Abrupt Change in 
Dynamics at Sample 325. 

shown in Table 7, the departure is largely confined to the interval 300-350 although the transient 
has not quite died out in the interval 350-400. 

In all cases there was no assumption as to the nature of the change, and the procedure 
works as well on state jumps, changes in noise variances, or other changes. Note that the charac¬ 
ter of abrupt change is quite different depending on the type of abrupt change that occurs. The 
best detection of abrupt changes can only be achieved by an adaptive procedure that considers 
the multitude of data intervals and selects a near optimal data length for detection. These initial 
results on the adaptive detection procedure demonstrate that it is very sensitive to a variety of 
different abrupt changes in the model. 
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SAMPLE 

TIMES 

NUMBER OF POINTS IN DATA SUBINTERVAL 

a 

100 


iM 

250 

300 

350 

400 




m 

0503 

0579 

0513 

0.653 

0.653 

0.654 


Table 6. Value of Per Sample AIC for Subintervals of the Sample with an Abrupt Change in Ex¬ 
citation Noise Variance at Sample 325. 


SAMPLE 

NUMBER OF POINTS IN DATA SUBINTERVAL 


IBS 

100 


2 «r 

250 

300 

350 

400 



_ 

0.495 

0503 


0513 

21.842 

85.766 

43.180 

0595 






Table 7. Value of Per Sample AIC for Subintervals of the Sample with an Abrupt Change in 
State at Sample 325. 







































6. SMALL SAMPLE MULTIVARIATE ANALYSIS 


The approach to «mall sample inference in this Chapter is the use of the entropy measure of 
model approximation error to evaluate the performance of small sample methods. This general 
approach is based on the justification of entropy as the natural measure of model approximation 
error as developed in Chapter 2. The historical Bayesian predictive inference approach plays a 
major role in providing a computable predictive density which is subsequently shown to be 
optimal in terms of the entropy measure. This optimality is established by considering best invari¬ 
ant predictive densities. For the multivariate normal family, several predictive densities are com¬ 
pared with the best invariant to show the large improvements that are possible in small samples. 


6.1 Bajdia Predictive Inference 

The historical approach to predictive inference involves the use of Bayesian concepts and 
methods to determine the predictive density. Consider the parameterized class of probability den¬ 
sity functions 


F = {p(y^| 8)K)€0} 


(6-1) 


defined on the joint sample space (X.K). The predictive inference setup as in Section 2.1 is con¬ 
sidered where a predictive density p a (y| x) is to be constructed as an approximation to the true 
conditional density p(y\ xfi) for the unknown parameter value 8. In the Bayesian approach, the 
predictive density is constructed on the basis of an assumed prior density p( 8) on the parameters 8 
using Bayes rule. 

From Bayes Theorem, the posterior density of the parameters is given by 


'<*'>-**$* 


(6-2) 


where the marginal density of x is 


p(x) ■ Jp(0)p(*I 8V® 


(6-3) 


The Bayesian predictive density p b {y\ x) is then given by computing the marginal density using 
the posterior so 


Pk(y \*) * JV(y!*.8>(8|*ye (64) 

This approach is direct and simple although the assumption of a prior density p( 8) on the 







parame ter! 9 is bothersome from both a theoretical and practical point of view. 

A major objection to the Bayesian approach is the use of an arbitrary prior density on the 
parameters to express ignorance. If a uniform density is used for the prior on 9, then a nonsingu¬ 
lar transformation of the perimeters to a new set ^ * g( 9) and use of a uniform prior on 6 in 
general produces a different posterior density. Thus there is a certain arbitrary choice of the 
parameterization and resulting posterior. A way around this is the use of nonurformativ* prior 
densities. Such densities give posterior densities that are invariant to transformations of the 
parameter space (Jeffreys, 1961; Box and Tiao, 1973). In situations where a noninformative prior 
exists, it can be obtained in terms of the Fisher Information matrix. 

In recent years, some intriguing connections between the bayesian predictive density and 
concepts of entropy, frequentist methods, invariant methods, and noninformative priors have 
been made. Still in a strictly Bayesian context, consider the negentropy measure (2-6) applied to 
the Bayesian predictive density (64). Of course a Bayesian would take expectation of this meas¬ 
ure with respect to the unknown parameters 9 which will be called the Bayes Bisk. Using (64) and 
interchan g in g the order of integration, the Bayes Risk between any two predictive densities 
Pi(y I x) andp 2 (yj x) is 

,.*Oi(yl -O ; Pi(y\ *)) * Jp(®)/p(*I ®)JVW *.») to * p^| dy 

, _ p,(y|x) 

= x) dy (6-5) 

Now setting p x (y\ x) « p b (y\ x) guarantees that the Bayes Risk between p b and any predictive 
density p 2 is nonnegative and zero if and only if p 2 (y\ x) * p b (y\ x). 

Thus in a Bayesian context, the Bayesian predictive density is optimal in terms of the Bayes 
Risk, i.e. the expected negative entropy measure with the expectation also taken over the param¬ 
eters 9. As was noted by Aitchison in the original derivation of this result for the case of x and y 
independent, there are a number of interesting cases where the negative entropy measure, ie. the 
Bayes Risk excluding expectation over the parameters 9, is not a function of the parameters 9. In 
such cases the Bayesian predictive density is optimal in a frequency sampling sense where there is 
a fixed unknown true parameter value and the negative entropy (2-6) is used as the measure of 
error. This topic is discussed in the next section. 





63 


6J Bat tawiiat Pndldin Dswltim 

In several particular cases, the optimal predictive density has been found that minimis the 
negative entropy. Murray (1977) considers the class of d-dimensional multivariate normal densi¬ 
ties N 4 (\i>£) with 

P(y\*2) » (2w)^| XI -^expf-jcy-^/rV*)} (6-6) 

In this case Aitchison and Ounsmore (1973, p. 29) show that using the nooinformative prior den¬ 
sity proportional to | £| _1 , the Bayesian predictive density is the d-dimensional Student distribu¬ 
tion 


P,(y\ x) - Ji - (a-l,il,(a+lX«-ir 1 i] 
where the d-dimensional vector z is St d (kJ>/:) if it has density function 


( 67 ) 


_ n(*+i>2> _ 

^ a T{(k-d +1>2}| *c| - M {l+{z-bf (kc)-\z-b)Kk +1>2 


This Bayes predictive density was shown (Aitchison, 1973) to result in the negative entropy not a 
function of the unknown parameter 6.. It is thus also optimal in the frequency sense. 

This same result was derived by Murray (1977) using invariance concepts. Consider the class 
G of invariant predictive densities p a (yj x) that are invariant to translations and linear transforma¬ 
tions of the sample x. In this class, the best invariant predictive density p/(yj x) was shown to be 
(67) which gives a constant value of the negative entropy independent of the value of the true 
parameter value 6.. This gives a strictly frequentist interpretation of the Bayes predictive density. 
A stronger result reported very recently (Levy and Perng, 1986) is the minimali ty the negentropy 
for the best invariant predictive density p/(y | x) uniformly in the unknown parameters 
6. * (p.,1.) among any predictive density in the class G of invariant predictive densities. 

6J Campari— of Entropy for Multivariate Normal 

To illustrate the usefulness of the predictive inference approach using negentropy, the 
results for the multivariate normal distribution are given below in terms of the relative odds of 
the likelihood ratio. Consider the case (6-6) of the multivariate normal density Nj(p,£). Here 
three methods are compared: the estimative method using the predictive density 
p*(y| x) « N 4 (y,iH*)£(x)) where the maximum likelihood estimates fijx) and £(x) are used, the 
best normal with estimates (|i,(*+ 1X"-*-2)‘ l X) which minimize the negentropy in the class of 
normal densities, and the Bayesian predictive density which is identical to the best invariant 
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predictive denary. 

The expected n egentr opies ere shown in Tsbie 8 for the above three predictive densities. 


Predictive 

Number of Observations (n) 

Densities 

4 

6 

11 

14 

20 

SO 

1-Dimensional 

Estimative 

1.191 

0366 

0.130 

0.094 

0.060 

0.021 


(2-48) 

(1.170) 

(1.037) 

(1.121) 

(1.009) 

(1.001) 

Bea 

0.476 

0226 

0.103 

0.079 

0.053 

0820 

Normal 

d-21) 

(1.048) 

(1.009) 

(1.006) 

(1.002) 

(1.000) 

Best 

0282 

0.180 

0.094 

0.083 

0.051 

0.020 

Invariant 







8-Dimensional 

Estimative 

- 

- 

3687 

8.08 

285 

0.61 




(w 14 ) 

(221.4) 

(3819) 

(1127) 

Best 

- 

. 

6.79 

3.15 

1.63 

030 

Normal 



(8.93) 

d-56) 

(1127) 

(1.010) 

Best 

- 

. 

4.60 

2.69 

131 

0.49 

Invariant 








Table 8. Expected Negative Entropy (and the Geometric Mean of the Likelihood Odds Relative 
to the Best Invariant). 


In compering two predictive distributions, the relevant quantity is the difference between their 
negentropies (2-5) (Larimore (1983a)). The exponential of this quantity is the geometric mean of 
the relative odds of a sample y having come from the two respective predictive distributions. 
This exponential of the negentropy difference is also given in parentheses in Table 8 for the best 
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normal and estimative methods relative to the best invariant method. Since ap{a-b) = l+(a-6) 
for o-h«l, we see that for negentropy differences much less than unity, the odds of an 
observed d -dimensional sample y coming from either of two predictive distributions is about 
equal. For the negentropy difference near unity, these odds are disproportionate of order e =2.7; 
and if it is much greater than unit the odds can get very large. Note that a 20 percent increase in 
the negentropy as between the estimative and best invariant for d =1 and n =20 has only a one 
percent odds advantage. On the other hand, a 17 percent increase in the negentropy as between 
the best normal and best invariant for d =8 and n =14 has an odds ratio of 136. This emphasizes 
the importance of comparing the negentropy on the basis of the arithmetic difference and not the 
relative proportion. Note that for very small samples the relative odds ratio can be much larger 
than unity and even in the tens or hundreds. Thus there is a huge potential gain in the use of 
predictive inference in very small samples as has been noted in different terms by Aitchison and 
Dunsmore (1975, p. 231), Aitchison and Kay (1975) and Murray (1979). 












7. CONCLUSIONS AND RECOMMENDATIONS 


7i CntW—i From Ph mm I Study 

In this Phase I SBER study, statistical methods are developed using predictive inference and 
entropy. This approach has a strong intuitive appeal as a result of the justification of the entropy 
measure based upon the predictive inference framework and the fundamental statistical principles 
of sufficiency and repeated sampling. This approach applies to a wide class of inference problems 
including: 

• general inference methods such as parametric or nonparametric 
methods 

• exact evaluation of small sample procedures 

a determination of model order or structure including the case of non¬ 
nested multiple comparison 

a time aeries analysis including definition of optimal tracking of time 
varying p roc e ss e s and optimal detection of abrupt changes. 

The entropy measure provides a fundamental measure for the comparison of alternative statistical 

procedures and provides a basis for developing optimal statistical inference methods. 

In this study a number of particular topics were addressed from the predictive inference and 
entropy perspective including: 

• statistical model building involving the determination of parametric 
model structure and order in the general case of constrained multi¬ 
ple nonnested alternatives, 

• time series modeling and forecasting involving the determination of 
parametric model structure and order, 

• adaptive time series analysis involving optimal methods for tracking 
slow changes as well as for detecting abrupt changes or failures, 

• small sample inference for multivariate distributions of the exponen¬ 
tial family. 

Some major results were developed on these topics that demonstrate the feasibility and desirabil¬ 
ity of developing statistical methods using predictive inference and entropy. 

A number of results were obtained for the nonnested multiple comparison problem based 
upon the study of constrained maximum likelihood estimates. These include: 






• consideration of the general constrained case 

• general extension of Akaike's A1C procedure to constrained non* 
nested multiple comparison problems 

• solution of the general constrained cue requires that a condition on 
the Fisher information and Hestian matrices be satisfied 

• a general model order and structure selection method was shown to 
be asymptotically optimal. 

Previous developments consider only the case where the true parameter is approached asymptoti¬ 
cally and exclude the case where the true parameter lies outside the models considered. The con¬ 
strained case investigated in this study gives a basis for viewing the predictive inference and 
entropy method u model approximation when the models are restricted and asymptotically 
biased. These results provide a basis for the use of predictive inference and entropy on the gen¬ 
eral time series analysis and adaptive time series analyse problems involving constrained non¬ 
nested multiple comparison. 

Using currently available methods, the time series analysis problem is difficult because the 
parametric model structure is unknown and requires the fitting and comparison of many different 
models. Also current methods are numerically and statistically ilkoaditioned for some models. 
The approach of predictive inference and entropy provides a natural solution to the multiple com¬ 
parison problem. The results obtained using the predictive inference and entropy approach for 
multivariate time series analysis include: 

a Explicit expressions for a lower bound on the achievable accuracy in 
the estimation of the transfer function and power spectral matrix 

a This lower bound applies to the case where the true model order is 
unknown and a model order determination procedure is used. 

a The lower bound is achieved for large samples using maximum likel¬ 
ihood estimation and the A1C order determination procedure. 

An example of the estimation accuracy of a true ARMA(44) process using spectral smooching, 
AR model, and ARMA model fitting show the considerable difference in using these various 
methods. 

Markov models of time series were developed as a basis for stable time series analysis 
methods using the canonical variate method (CVA). This method is numerically and statistically 
stable and has been applied recently to a number of high order multivariable time series analysis 
problems. This approach provides the basis for adaptive time series analysis methods. 

Markov models of time series with changing characteristics were developed including slowly 
and abruptly changing procesres. Using the CVA method as the computational method, the 
entropy methods were applied to adaptive time series analysis: 
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• oMthodi for determining the optimal data length for adap¬ 
tation to slow changes were developed. 

• S uti etical methods for choosing the optimal time interval for detec¬ 
tion of an abrupt change in the proc e ss were developed. 

• The entropy measure is optimally sensitive to any abrupt changes 

including in pr o ces s dynamics, changes in the excitation 

noise levels, and jumps in the process state. 

These results demonstrate the feasibility of developing adaptive time series analysis methods 
based upon entropy methods and the CVA computations. The CVA computations have been 
demonstrated in real time identification of multivariable systems. Thus the feasibility of adaptive 
tune series analysis in real time has been demonstrated. 

Small sample methods were developed using the predictive inference and entropy methods. 
The justification of entropy based upon the sufficiency and repeated sampling principles provides 
a sound justification for the use of recently developed small sample methods based upon entropy. 
The Bayesian method was extended to the case where the informative and predictive experiments 
are dependent. The theory is illustrated for the multivariate normal distribution using the Baye¬ 
sian, best invariant, estimative, and best normal predictive densities. The relative measure of 
approximation is shown to be the per sample relative odds ratio which is the exponential of the 
entropy measure. 


for Further R esea r ch and 


This study has demonstrated the feasibility and usefulness of predictive inference and 
entropy methods particularly in the areas of: 


• constrained nonnested multiple comparison of models 

• model order and structure determination for time series 


o modeling of changing processes using Markov model structures 

• optimal adaptation to slowly varying pro ces s es by optimal selection 
of data interval 

• optimal adaptation to abrupt changes of unknown type at unknown 
times by optimal selection of the detection data interval 

• automatic stable computation of time series models using the CVA 
method 


• determination of tower bounds for the estimation of transfer func¬ 
tions and power spectra 

These achievements provide a bases for the further research and development of predictive infer¬ 
ence and entropy methods. 




' %'aV- ■. .V, 


V \ A 
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The areas of greatest promise appear to be those of adaptive and aoaadaptive time series 
analysis for the following reasons: 

• the number of potential applications to DoD systems is very large 

e time series analysis and adaptation are the major problems in adap¬ 

tive control 

• to address the adaptive time series analysis problem requires an 
approach that deals with the multiple comparison problem in a fun¬ 
damental way that is offered by predictive inference and entropy 
methods 

e among the current time series analysis methods, only the CVA 
method is suitable for real time solution of the problem 

• present and near future computers are capable of multivariable iden¬ 
tification in less than a second of computation for high order systems 
of doaens of states 

These methods have been demonstrated to be feasible, and the development of online adaptive 
time series analysis software for general application would provide an enormous capability for 
DoD systems. Presently there are no other known approaches that will achieve this goal. Such 
adaptive time series pro ce ssor s would allow for the adaptation of systems to slow and abrupt 
changes in the environment. 

The topics recommended for further research and development include: 

• further research and development on the adaptive time series 
analysis methods for adaptation to slow and abrupt changes 

• development of algorithms for implementation of the adaptive 
methods that are numerically stable and accurate and are statisti¬ 
cally reliable 

• prototype algorithm testing to demonstrate the accuracy, reliability, 
and computational requirements on typical DoD problems. 

• s oft ware development to provide modular, documented, and verified 
software in one or more general programming languages. 

The achievement of these objectives would provide a dramatic improvement in the performance 
of adaptive methods and the availability of software for adaptation in DoD systems. 
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The general problem of choosing a model from among a multitude of alternative 
models remains one of the difficult problem of statistical inference. Traditional methods 
of statistical hypothesis testing apply directly only to the case where there are two 
hypotheses under consideration and one is a subset of the other, i.e., the two hypotheses 
are nested. In such cases, classical methods are applicable and lead to well understood 
results. In the case were there are more than two hypotheses involved, the use of the 
classical methods are not well defined or understood even in the nested case (see for 
example the discussion in Anderson, 1971, pp. 270). Although the probability of rejecting 
one hypothesis in comparing any pair is well defined, the repeated application of such 
pairwise comparisons results in a test whose properties are not understood. In the case of 
comparing two hypotheses that are not nested, the distribution theory is available but 
much more complicated (Larimore, 1977). 

Beyond these difficulties in carrying out the classical procedures is the lack of a gen¬ 
eral framework for formulating and solving the problem of nonnested multiple comparison 
of constrained models. The predictive inference approach offers a predictive measure of 
the accuracy of various model selection procedures that apply as easily to the case of non¬ 
nested multiple comparison. The adequacy of a model selection procedure is measured in 
terms of the accuracy of the selected models in hypothetical repeated 'future* experi¬ 
ments. This is very attractive in the context of scientific inference were the role of model 
building is to provide a basis for prediction of the future behavior of a phenomenon. The 
entropy measure provides a most sensitive measure based upon the sufficient statistic as 
contained in the likelihood ratio. The derivation of the entropy as a measure of the pred¬ 
iction error of a predictive density in the predictive inference framework is based upon 
the fundamental principles of sufficiency and repeated sampling (Larimore, 1983). This 
provides a strong theoretical basis for the use of entropy in predictive settings of scientific 
inference. In more narrowly defined problems of quality control or decision theory 
involving a well defined lorn function, ocher procedures may be more appropriate. But in 
a predictive scientific setting, the approach using predictive inference and entropy seems 
much more justified. 

Consider the problem of choosing among a multitude of model structures on the 
basis of a set of observations. U we adopt the predictive criterion that the chosen model 
should be the best in a predictive sense in predicting another independent sample from 








the «*«"* process, then the optimal choice is the model selection procedure with the 
minimum neg entr opy. The major problem is the practical evaluation of the negentropy 
measure and the determination of efficient procedures that come close to minimising the 
negentropy measure. 

In this paper, the theory of inference for nonnested multiple comparison is 
developed in the context of predictive inference and entropy. To develop a general 
theory, the case of maximum likelihood is considered for moderate and large samples. 
The case where the true process model is not contained in the models considered for 
inference is the usual situation in scientific inference since even the most general model 
forms usually do not include certain complexities such as nonlinearities, noostationarity, 
etc., that may have a small effect or be very difficult to handle. Previous approaches 
involving the entropy measure have not explicit included this misa-modeling. It is shown 
that this miss-modeling can be directly considered in the analysis. The resulting theory is 
very attractive in that is gives an explicit interpretation of the predictive inference 
approach as model approximation of the true process using simplified alternative model 
forms. The entropy methods lead to procedures that select models that in the predictive 
sense are the most accurate in approximating the true process model. The classical diffi¬ 
culties of nonnested and multiple comparison do not arise in this predictive inference set¬ 
ting. 

In the paper, first the subject of constrained maximum likelihood estimation is 
developed in the predictive inference and entropy context. This is used to derive the 
expected negentropy for maximum likelihood estimates, and then to determine an 
unbiased estimate of the entropy. Finally, bounds on the achievable accuracy of model 
selection procedures is derived that depend on the number of estimated parameters in the 
model fitting. 

2. Cowtraftaal Maximum tjwwmm»h Estimation 

In this section, properties of the maximum likelihood parameter estimates are 
developed for the case that the true probability model is not contained in the class of 
parameterized densities that are considered for inference. The classical development of 
the asymptotic consistency and minimum variance of maximum likelihood estimators is 
for the case where the true density is contained in the parametric class. 
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The predictive inference framework as in Larimore (1983) is adopted here with 
p(x,9) the parameterized probability density where 6 is a vector of parameters, x is the 
informative sample and y is the predictive sample. Suppose that the parameter vector 
is a finite or infinite set of parameters, and for each subset of distinct posi* 

tive integers k =(k l . k m ) consider the subspace 0 t of 0 such that only the correspond* 

ing 6 t| ,... .0* are nonzero where 0* denotes a member of 0*. and let C k be the the 
class of models C k =^p(x ,&),& €8 t }. These classes of models are in general nonnested so 
that we do not in general have C*CC, or C jCC k . The maximum likelihood estimator for 
the class C k will be denoted as 0*(x). 

The development of the maximum likelihood theory is straight forward for the case 
where Taylor series expansions are possible. This holds under the following regularity 
conditions (Cox and Hinkley, p. 281): 

(i) The parameter space is closed and compact. 

(ii) The probability distributions defined by any two different values of 0 are distinct. 

(iii) The first three derivatives of the log likelihood /(x,0) with respect to 0 exists in the 
neighborhood of the true parameter value almost surely. Further, in such a neigh¬ 
borhood, n -1 times the absolute value of the third derivative is bounded above by a 
function of x , whose expectation exists. 

In particular, these conditions permit the interchange of expectation and differentiation 
up to second order. 

In the discussion various order models are considered, and the relationships 
between the various orders is developed. The log likelihood function of the informative 
sample x will be denoted by /(x,0), and the gradient row vector and Hessian matrix 
denoted / '(x ,0) and /”(jc, 0) respectively. Expectation , denoted E, will be with respect to 
the true density p(x,9) unless stated otherwise where 0 denotes the true parameter value. 
Define the projection 0* of 0 onto 0 t as the parameters 0* £0* minimizing the negentropy 
R m relative to the informative sample x 

*,(0,0*) » El(x ,0) - El(x ,0*) (2-1) 

At the minimum 0 * , the gradient of (2-1) is zero so from the regularity conditions 


£/ (x,0*) -0, 


(2-2) 
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and the minimum is unique if and only if the expected Hessian, denoted £)*, of (2*1) is 
positive definite in an open neighborhood of the minimum. From the regularity condi¬ 
tions, die Hesoan is given by D*=£/''(x,8*). 

To deter min e the moments of the maximum likelihood estimates 0*. consider the 
first order equality 

o = /'(x.e*) = /'(x.e*) + (2-3) 

Taking expectation with respect to the the true density and using (2-2) gives the equation 

D*(£9*-0*) - 0 (2-4) 

that holds asymptotically for large informative sample N. For 0* identifiable, i.e. 9* 

unique, D\ is noosingular which implies that to first order 

£6* - 9* (2-5) 

Now using (2-3), the covariance of the estimation error is 

=(D t V£{/' 7 (x,9»V’(x,9*)KD 4 *)- 1 (2-6) 

Note that in the unconstrained case, the middle term which is the Fisher information 
matrix is equal to minus the expected Hessian £>*, but this is not in general true for the 
constrained case. 

3. Expected Negative Entropy for Maximum Likelihood Estimates 

To evaluate the expected negative entropy for maximum likelihood estimates, con¬ 
sider the predictive inference setup as in Larimore (1983). The general case of depen¬ 
dence between the informative and predictive samples is considered. The expected nega¬ 
tive entropy is a measure of the degree of approximation of the true conditional density 
P(y\*fi) by the predictive density p[y\ x,0*) for predicting the future predictive sample y 
from the informative sample x. The expectation will be taken in two steps, first with 
respect to the random variable y| x and then with respect to x In this section the likeli¬ 
hood function 1(9*) * /(y| x,9*(x)) is for the predictive sample y|x, and the maximum 
likelihood estimator 0*(x) is on the informative sample x. 

Expanding (2-1) in a Taylor series gives a second order expression for the informa¬ 
tion distance which holds asymptotically for large sample size of the informative sample, 
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i.c. for the i "‘ T ' mi| m likelihood estimate 0* close to the projection 0* 

,( M *(*)) * £[/(§*) - /(0*)1 + £[/(§) - /(§*)] 

- - 9*)] - - 0*)] + £[/(§) - /(0*)1 

* -\ E *W &(*)-** II i + £,!,(».«*) (3-1) 

since 0*(x) is independent of /(y| x,0*) and using the gradient property of the projection 
(2-2). The second order expansion is only locally in the estimation error 0*-0* about the 
projection 0* of the true parameter value 0 on the subspace of the parameters correspond¬ 
ing to the model O*. Note that this exprewon gives the exact bias term £^,(0,0*) in small 
samples and involves no approximation. 

4. Unbiased Estimation of Entropy 

For decision on model parametric order and structure, it is necessary to estimate 
the negative entropy based on the informative sample. One such procedure is due to 
Akaike (1973). We consider the case where the informative sample x and the predictive 

sample y are independent. For each selection of a parameter subset k={k x . k m ), the 

Akaike information criterion for comparing the m aximum likelihood estimators is 

AJC(k) = -21ogp(x,0*(x)) + 2 K(k) (4-1) 

where K(k) is the number of parameters, i.e. the dimension of 0*. The minimum AJC 
estimator (MA1CE), denoted 0 4 (x), is d A (x)**fr^\x) where k(x) is the parameter set 
minimizing AlC(k). The A1C(k) is an unbiased estimator of the negative entropy based 
upon the informative sample and the assumed model structure. The predictive sample is 
essentially replaced by the informative sample, and the term 2 K(k) is an adjustment for 
the bias due to the correlation between the informative sample x and the estimate 0*(x). 

Following Akaike, we use the maximized log likelihood /,(0*) - /(x,0*(x)) as an 
estimate of the relative entropy and compute the bias in the procedure. We expand the 
log likelihood function as in (3-1) except that below the likelihood is on the informative 
sample so that there is dependence between /(x,0*) and 0*(x). In particular, using (2-3) 
gives 
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- e - «*)] - £((«* - Vfr&yp - e*)] (4-2) 

Consider expending the expected log likelihood difference as in (3-1) but using (4-2) as a 
result of the dependence. Thus 

E[m - /(»*)] - E [/(§*) - /(**)] + £(/(§) - /(§*)] 

- -£[/'(e*X«‘ -•*)] -£[{(«*- -?)]+ £[/(«) - /(0*)] 

- £((«* - e*) r / *(e*Xe‘ - 6 *)] + *(0,0*) 

- -tr(D*r X E{l T(x ,9*V (x,§*)} + *(0,0*) (4-3) 

where the third equality follows using the expression (3-1) for the expected negentropy. 
In the general case, the bias term in the negentropy, *(0,0*), is correctly estimated except 
for the trace term. Asymptotically, if 0* approaches 6, then the matrix of the trace is the 
identity. This is the case considered by Akaike (1973). In the general constrained case, 
this may not be the case so that the bias depends upon the unknown true parameter value. 
What is required is a restriction such as the Hessian and Fisher information matrix being 
equal globally which gives 

-tr(Dfr'E{i r(x,0*y (*,§*)} = m t{k) = *(*) (4-4) 

Consider the case of fitting two models 0* and &, and consider the expected differ¬ 
ence of the maximized log likelihoods 

£[/(«*) - !&)) - £[/(§) - /(80) - £[/(§) - /(§*)] 

- +dun(e*)-</im(80 + *(0,0*)- *(0,0*) (4-5) 

Thus for relative comparisons among hypotheses based on a given sample, an unbiased 
estimate of twice the negentropy £[/(0) - / (0*)] is given by the Akaike information cri¬ 
terion. Note that the proof of this is much more general than that originally given by 
Akaike (1973) since it applies to the general case of comparisons of nonnested structures. 
Also, the true parameter 0 need not be contained in the structures being compared so 
long as the Fisher information matrix is a constant in a neighborhood including the true 
parameter and its projection onto the subspaces of these structures. 


'iS. ■ / m.' »y mj? ^ ^ 
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S. Bwadi m Ik* Accuracy of Modd Selection 

To obtain a correction for the bias in the sample log likelihood as an estimate of the 
entropy measure, the Fisher information must be constant or other restrictions are 
required. In this section the Fisher information is assumed to be constant in a neighbor* 
hood of the true parameter 0 containing the projection 0 *. Under these conditions a 
lower bound on the entropy measure is derived. From the above relationships and stan¬ 
dard arguments of asymptotic consistency, it follows (Cox and Hinkley, 1974, p. 292) that 
the constrained maximum likelihood estimate 0 * is consistent and the limit in probability 
is 0*. In addition the properties of the unconstrained maximum likelihood translate to the 
projection 0* since the Fisher information matrix is constant. 

Consider first the case of estimating the model 0 using the *-th order maximum 
likelihood estimator 0*. Then from (3-1) 

£( 0 , 0 *) * D*£{(0‘-0‘Xe*-» k ) r } + *(M*) 2 = + *(M*) (5-1) 

where the inequality follows from the Cramer-Rao lower bound 
£ I [(0*-0*) r (-D*X® l ~®*)] a tr! K{ki N =* K(k)/N where K(k) is the number of parameters 
estimated in the model 0* and N is the sample size. The last term in (3-1) is the bias in 
using too low an order in the model fitting, and the first term is the sampling variability 
apart from the bias. This bound K(k)/2S on the variability is achieved for an asymptoti¬ 
cally unbiased and efficient estimator for the class C k such as maximum likelihood. In 
particular, if the true model order is no greater than k , then 

Lim N [£(0,0*) - £(0,0* )] a j (5-2) 

The true order k of the process is usually not known and may in fact be infinite, and the 
bias term in (5-1) is not known so that the above discussion is not very useful in practice 
although it dose give some insight into the accuracy issue. 

Consider now the Akaike MAICE procedure using the estimator 0 A . Assume that 
the true model order is infinite, so that for any / there exists a 7 as/' such that 0 jX). 
Define the optimal predictive order k'{N) depending upon the sample size N as the order 
k minimizing the negentropy (3-1). Then under suitable assumptions, the remarkable 
result is obtained by Shibata (1981a, 1981b, 1983) that asymptotically as N-*> 






(i) the lower bound using any order selection scheme is for each N given by evaluating 
(M) at *-*•(*), and 

(ii) this lower bound is achieved by the MAICE estimator & A . 

Thus the lower bound on the negentropy which is achieved by MAICE is equal to the 

negentropy that would result from using an efficient estimator with apriori knowledge of 

the optimal order k*(N). 
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1 . INTRODUCTION 


The problem of determining the achievable accuracy in identifying a model for a station¬ 
ary multiple time series is considered in this paper. The cases of the presence or absense of an 
exogeneous input or additive measurement noise are included. Consider the general case where 
x(r) is the exogeneous input and y(t) is the observed endogeneous output of a system which may 
include other unknown excitations and measurment noise. Thus consider the jointly stationary 
gaussian vector time series x(t) and >(/), t =...,-1,0,1,..., with power cross-spectral matricies 
S a (w, 0 ), $,,(<i», 0 ), £„(<■>, 0 ) parameterized by 0 , and denote the power cross-spectral matrix of the 
joint vector (x T (t), y T (t)f as 5(<n,0). 

Statistical inference is considered on a class of linear Gaussian processes parameterized by 
0 . Specifying a parametric model for the conditional process y(t), t&s given x(t), t<s implies a 
causal linear model of the form 

y (0 = < 7(0 + J*(r-r^)x(r) = q(t) + r(t) ( 1 ) 

r»O 

where h(t;d) is a causal linear system giving the response r(i) in y(t) due to the past exogeneous 
input x(r) and where q(t) is the error in predicting y(t) by r(t). From linear prediction theory 
(Gikhman and Skorokhod, 1969), the transfer function of h(t ; 0 ) is //(<*> ; 0 ) =S^(u> , 0 ), and 
the error q(t) in predicting y(t) is uncorrelated with r(t) with power spectrum S w (w;0)=sS w (<i>,«j 
- H («>,0). Note that any class of parameterized models S(<i>,0) can be equivalently 
specified by the parameterized models (5 W (<a,0), ff(u>,0)) which will prove more convienent. 

2. ENTROPY AND SPECTRAL ACCURACY 

Consider the following predictive inference setting (Larimore, 1983) involving an observed 
informative sample u T =(x T (l)j T (l),...j T (N)y T (N)) of size /V used to estimate the process model, 
and similarly consider a conceptual predictive sample v of size Af used to evaluate the accuracy of 
the estimated model. The predictive sample is assumed to be identically distributed but indepen¬ 
dent of the informative sample. Consider the problem of inference on the parametric class 
{p (v , 0 ), 0 € 0 } of models with probability densities p(v , 0 ) based upon the informative sample u . 
Consider the conceptual repeated sampling experiment where on each trial the samples u and v 
are each drawn independently from the process 5(u,0.) with 0. assumed to be the true parameter 
value. An estimative model p=p(v M u )) is chosen for the density of v by some parameter estima¬ 
tion scheme 0(u) . The negative entropy, also known as the expected Kullback-Leibler discrimina¬ 
tion information or expected I-divergence, is a measure the error in approximating the true 
density p. of v by the estimate p and is given by 
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- p(v,0.) 

R(p,j) = E u K(p,j>) =£„ J p{y ,0«) log- ■ dv (2) 

p(v-,e(M)) 

where E m denotes expectation relative to u and K denotes the Kullback-Leibler information. The 
negative entropy measure follows as the natural measure in the predictive inference setting from 
the fundamental principles of sufficiency and repeated sampling (Larimore, 1983). This 
approach applies to very general modeling methods such as nonparametric, semi parametric or 
parametric procedures as well as methods including decisions on model structure or order such as 
those used for AR and ARMA modeling. 

Let lower case variables denote a sample of size M of the predictive sample, eg. 
y =(y T (l)j T (2),...j T (Mjf and denote the covariance matrix of y. By expressing the density 

P(yj fi) ~ p(y~r fi) p(* fl) in terms of the conditional random process q(t) = y(t)-r(f), 

p(y * .8) * p(y I * ,0) p(x ,8) = p{y - r(x ,0),0) p(x ,0) = p{q\r(x ,0),0) p{x ,0) (3) 

the log likelihood separates with the density of i(t) in many problems not a function of the unk¬ 
nown parameters or at least a function of a separate set of parameters. A conditional viewpoint 
is taken in the following where only the conditional term p(q\ r(x,0),0) with x considered as non¬ 
random is considered. The depencence of r on x will be understood in the notation. Inclusion of 
the second term is tantamount to modeling the joint vector time series involving the two series 
x(i) and y(t) jointly rather than as exogeneous and endogeneous respectively. The joint case is 
included as a special case of y 1 (/) = (y T (t)j T (t )) a vector process with no input x(t) which will 
be discussed as a particular instance of the model throughout the paper. The I-divergence (2) con¬ 
ditional on x thus becomes 


K(p*j) = f p(<l\x. 9.) log 


P(<l\x,9.) 

P(<l\x,9) 


dq = E\og>(y-r,X„) - E\ogp(y-f S„) 


” Elogp(y-r.,2„) - E\ofp((y-r.)+{r.-f)X„) 


■ Elogpiy-r.X,") - E[ofp((y~r,)S„) - E(r.-ff t^(r.-r) (4) 

where t -r(x,0), r. »r(x,0.), and where E denotes expectation with respect to the density 

p(y i *,••)• 

For brevity set 5 denote the true spectrum, and let 5 denote an estimate of S. We will 
need to asume that S(<*») is continuous and that 5 w (w) and S w («) are positive definite for 
. In the discussion, the predictive sample v will be considered to be conditional on x(r) 
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and to have an infinite sample size M This will require the normalization of the negative 
entropy and 1-divergence by the sample size M The I-divergence per sample time conditional on 
x(r), whi ch will be denoted l(Sj) and called I-divergence for brevity, can be expressed using (4) 
as (Kazakos and Papantoni-Kazakos, 1980) 

t(Sj) - 

- - J /<**#! * tr[l - 

j ftr<S H l lHM - H(»))S a (»MM - (5) 

where the subscript emphasues that the sample of size M of v becomes infinite. The negative 
entropy per sample, or ncgentropy for brevity, is defined as N(S J)=iim-^-R(p.j}) -E m l(sj). 

Note that the I-divergence is composed of two terms, the last due to the error in estimating the 
transfer function //(<*) and the first due to the error in estimating the spectrum S w (w) of the 
noise q(») A useful approximation for the first term in (5) is 

- } Si l °g\ ‘(«)l + tr[I - 

-w 

35 («) 

which holds to second order in the elements of S n as is easily shown by comparing first and 
second derivatives of the integrands. This is a generalization to the multivariate case of the 
integral of the squared relative error. Thus the I-divergence is approximately a quadratic form in 
the estimation errors of S w («) and H( w), and these quadratic forms do not interact, in. there 
are no crom terms. 

3. NORMALIZED SPECTRAL ERROR IN PRINCIPAL COMPONENTS 

In the multple time series case, the spectral measure (5) has an intuitive interpretation in 
terms of principal components of the power spectrum in the frequency domain. Principal com¬ 
ponent representations of the spectral tnatricies 5 W («) and S a (<*>) have the form 


/(«) J„(u)y*(<-) - D(u>) , L(w)S a (»)L m (») »£(«) 


(7) 










where J(m) and L(m) given as a function of frequency u> are unitary matrix transformations which 
diagonalia* S a (m) and S^(m) respectively so that J (u»y*(w) * / = £,(«)£,*(«), and where D( u) 
and £(«) are diagonal matriciet. 

Using spectral factorization theory, the matrix function /(<■>) can be chosen as a continu¬ 
ous function of u> and the transfer function of a causal filter under either of the mild assumptions 

(i) The process is purely nonde termini Stic ((Gikhman and Skorokhod (1969), Whittle(1954) for 
the scalar random field case). 

(ii) The autocovariance function is absolute summable (Goodman and Ekstrom (1980) for the 
scalar random field case). 

These derivations of the spectral factoriation for the scalar case generalize to the multivariable 
case with care given to determining the logarithm of a matrix (Larimore, 1984, 1977). Otthonor- 
maliation of the rows of the spectral factor gives /(u) while the normalizing terms form the diag¬ 
onal of D( w) . Similarly, L(w) can be taken as a spectral factor of a causal filter. 

Let Jf(e) be the random Fourier coefficients of x(r), in. the spectral random measure of 
x (‘) ■ Filtering x(i) with transfer function £.(«) gives the principal component process x(r) which 
is expressed in the frequency domain as X(<a) * L (u)X (w) , and which has the diagonal spectral 
matrix £(<■») and similariy for q(t). 

Now consider the asymptotically equivalent expression (6) for the first term of the spectral 
measure (5) which is invariant to the unitary transformation J(m) and is thus given by 

» 1 V ** I v } I *></(«)!* du> 

~ 4 , i D u (») f 4w 2 ~l m D a ( ! rn)D j/ (u) 4w 

5 i - . i v ^ 

4 Ti D,(m) r 4w 4(«)D ; /«)4 w () 

where the approximation holds for the diagonal elements of D(u) near D(u). The approximation 
is very useful when only the estimated spectrum is known and we wish to coasider the 

error in estimating the truth . The first sum on the right hand side is the integrated 

squared relative error of the estimated cospectra of the principal components, while the second 
term is the integrated squared coherency of the estimated spectrum D(«) which would be zero if 
D(m)-D(h) . Thus the first term of the measure (5) has a clear interpretation in the multivariate 
cue when the true spectrum D(w) is diagonal but where the approximating spectrum £>(<■») is 









permitted 


arbitrary coherency among components. The general case is reduced to this diagonal case by 
chooting an appropriate filter /(<■>) which diagonalizes 5 W («) as in (7). 

The second term in the spectral measure (5) is invariant to the unitary transformations 
J(u) and L(m) which gives 

- { /MD-'MIGM - CMIEMOM - C(„)r>£ 

- - {/XI <!*<•> - G V MI (9) 

L -+IJ C JJ 

where C («)-7 («y/ («)£.’ (w) is the transfer function H (w) expresmd in the coordinate frame of 
the principal component series x(t) and y(t). The squared magnitude error | C (; («) - C /y («)| 2 in 
the ij element of the transfer function is weighted by the input signal to output noise ratio 
D u E }j for the pair «j. 

4. A LOWER BOUND ON ACHIEVABLE SPECTRAL ACCURACY 

A lower bound on the expected negative entropy gives an asymptotic lower bound on the 
achievable accuracy in the estimation of the process spectrum and transfer function. This applies 
to the case of a fixed known model order as well as to the case of an unknown or infinite model 
order with the use of the AIC for model order selection. The achievable spectral accuracy is 
given as a function of the sample size and the number of parameters estimated. 

Consider the case where the model S is estimated using a K dimensional constrained estima¬ 
tor 0*. As derived in Larimore (1986) using the Cramer-Rao lower bound, the expected negen- 
tropy is asymptotically bounded by 

+E u I(sJ) (10) 

where S is the model to which the constrained maximum likelihood estimate converges so that 
£ a /(5 J) is the bias in the constrained model and KITH is the sampling variability. 

The bound K/2N is achieved for an asymptotically unbiased and efficient estimator such as 
m axi m u m likelihood. In particular, this assumes that the true model order is no greater than the 
order K used in the model fitting. The true order K of the process is usually not known and may 










inf act be infinite, so that the above discussion is not very useful in practice although it does give 
some insight into the accuracy issue. 

Consider now the Akaike minimum AIC procedure (MAICE) using the estimator e A 
(Akaike, 1973). Assume that the true model order is infinite, so that for any subset of the infinite 
parameter vector, there exist nonzero components. Thus it is not possible to obtain asymptotically 
unbiased estimates of 0 using a fixed model order in estimating a model. Following Shibata 
(1983, 1981a, 1981b), define the optimal predictive order K'(N) depending upon the sample size 
N as the order K. minimizing the negentropy (10). Then under suitable assumptions, the remark¬ 
able result is obtained by Shibata (1981) that asymptotically as W-® 

(i) the lower bound using any order selection scheme is given by evaluating (10) at K=K', and 

(ii) this lower bound is achieved by the MAICE estimator & A . 

Thus the lower bound on the negentopy which is achieved by MAICE is equal to the negentropy 
that would result from using an efficient estimator with apriori knowledge of the optimal order 
K\N). 

Using' the spectral expression (5), an asymptotic lower bound on the expectation of the gen¬ 
eralized relative squared error in estimating the power spectrum is given by 

s i £. j,(j;V)(s,(.) - 

+ } £. />rfi w -'(«( u ) - (»){«(«•) - (11) 

This gives a fundamental limit to the achievable accuracy in any parametric estimation pro¬ 
cedure. A further perspective on this fact is given by the justification of the expected negentropy 
as the natural measure of modeling approximation error in statistical inference. 
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